Peter Heywood, Research Software Engineer
The University of Sheffield
2024-01-23
Celeritas is a new Monte Carlo transport code designed for high-performance simulation of high-energy physics detectors.
The Celeritas project implements HEP detector physics on GPU accelerator hardware with the ultimate goal of supporting the massive computational requirements of the HL-LHC upgrade.
ncu

CELER_ENABLE_PROFILECELER_ENABLE_PROFILING
CELER_ENABLE_PROFILE=1 in PR #827nsys to only profile selected regionsncuthrust methods per step which allocate and deallocate temporary memory
Thrust library callsCCCL (CUDA C++ Core Libraries) - github.com/NVIDIA/cccl
Thrust allows you to implement high performance parallel applications with minimal programming effort through a high-level interface
CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model:
The C++ Standard Library for Your Entire System
remove_if_alivesize_type remove_if_alive(
    StateCollection<TrackSlotId, Ownership::reference, MemSpace::device> const&
        vacancies,
    StreamId stream_id)
{
    ScopedProfiling profile_this{"remove-if-alive"};
    auto start = device_pointer_cast(vacancies.data());
    auto end = thrust::remove_if(thrust_execute_on(stream_id),
                                 start,
                                 start + vacancies.size(),
                                 IsEqual{occupied()});
    CELER_DEVICE_CHECK_ERROR();
    // New size of the vacancy vector
    return end - start;
}remove_if_alive
remove_if_aliveexclusive_scan_countssize_type exclusive_scan_counts(
    StateCollection<size_type, Ownership::reference, MemSpace::device> const&
        counts,
    StreamId stream_id)
{
    ScopedProfiling profile_this{"exclusive-scan-conts"};
    // Exclusive scan:
    auto data = device_pointer_cast(counts.data());
    auto stop = thrust::exclusive_scan(thrust_execute_on(stream_id),
                                       data,
                                       data + counts.size(),
                                       data,
                                       size_type(0));
    CELER_DEVICE_CHECK_ERROR();
    // Copy the last element (accumulated total) back to host
    return *(stop - 1);
}exclusive_scan_counts
exclusive_scan_countsthrust::remove_if with cub::copy_if
cudaMallocAsyncthrust::exclusive_scan with cub::DeviceScan::ExclusiveScan
thrust::cuda::allocatorN allocations with between 1 and NN deallocations and 1~10s of lines of code~10 µs per step improvement0.2% improvement

1-10 µs per kernel launch6087 kernel launches in 438ms of TestEM3 with a smaller input file. CUDA Graphs have been designed to allow work to be defined as graphs rather than single operations.
providing a mechanism to launch multiple GPU operations through a single CPU operation, and hence reduce overheads
Simple kernel, ~6µs per call
#define N 200000
__global__ void shortKernel(float * out_d, float * in_d){
  int idx=blockIdx.x*blockDim.x+threadIdx.x;
  if(idx<N) out_d[idx]=1.23*in_d[idx];
}


  bool graphCreated=false;
  cudaGraph_t graph;
  cudaGraphExec_t instance;
  for(int istep=0; istep<NSTEP; istep++){
    if(!graphCreated){
      cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
      for(int ikrnl=0; ikrnl<NKERNEL; ikrnl++){
        shortKernel<<<blocks, threads, 0, stream>>>(out_d, in_d);
      }
      cudaStreamEndCapture(stream, &graph);
      cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
      graphCreated=true;
    }
    cudaGraphLaunch(instance, stream);
    cudaStreamSynchronize(stream);
  }5-7µs per kernel without overheads| Method | Elapsed Time (ms) | Per Kernel w/ Overheads (µs) | 
|---|---|---|
| Synchronized | 231 | 11.55 | 
| Streams | 141 | 7.05 | 
| Graph | 124 | 6.20 | 

along-step-uniform-msc::propagate IsAlongStepActionEqual with UniformFieldPropagatorFactoryvoid celeritas::detail::<unnamed>::launch_action_impl<
  celeritas::ConditionalTrackExecutor<celeritas::detail::IsAlongStepActionEqual,
  celeritas::detail::PropagationApplier<celeritas::detail::UniformFieldPropagatorFactory, void>>,
  celeritas::detail::PropagationApplier<celeritas::detail::UniformFieldPropagatorFactory, void>,
  (bool)>(celeritas::Range<celeritas::OpaqueId<celeritas::Thread_, unsigned int>>, T1)along-step-uniform-msc::propagate
along-step-uniform-msc::propagate kernelalong-step-uniform-msc::propagate
along-step-uniform-msc::propagate kernelalong-step-uniform-msc::propagate
along-step-uniform-msc::propagate kernelSorting consistently improves the performance for the cms2018 problem but decreases the performance with the tilecal.
… the radix sort kernel is a significant overhead in that case …


along-step-uniform-msc::propagate
ncu Source counts for a shorter invocation of the same along-step-uniform-msc::propagate kernel. Unfortunately the profile I used for this talk was incomplete / without lineinfoProgress report: Profiling Celeritas on Nvidia GPUs - ExaTEPP workshop