Progress report: Profiling Celeritas on Nvidia GPUs

Peter Heywood, Research Software Engineer

The University of Sheffield

2024-01-23

Recap

June 2023 Workshop

Celeritas

Celeritas is a new Monte Carlo transport code designed for high-performance simulation of high-energy physics detectors.

The Celeritas project implements HEP detector physics on GPU accelerator hardware with the ultimate goal of supporting the massive computational requirements of the HL-LHC upgrade.

GPUs

  • Development machine:
    • NVIDIA Titan V (SM 70, 250W)
    • Intel i7-6850K
  • TUoS Stanage HPC (Tier 3):
    • NVIDIA H100 PCI-e (SM 90, 350W)
    • NVIDIA A100 SXM4 (SM 80, 500W)
    • AMD EPYC 7413
    • Exclusive reservation to use ncu

Titan Xp & Titan V GPUs

NVTX Annotations

Timeline without NVTX

Timeline showing 60ms portion of TestEM3 without CELER_ENABLE_PROFILE

CELER_ENABLE_PROFILING

Timeline showing 60ms portion of TestEM3 with CELER_ENABLE_PROFILE=1 in PR #827

Nsight use of NVTX

  • Can encourage nsys to only profile selected regions
nsys profile -c nvtx -p celer-sim@celeritas ... 
  • Can only profile specific kernels in ncu by range
    • Requires recent enough ncu
ncu --set=full --nvtx --nvtx-include "celeritas@celer-sim/step/*" \
    --launch-skip 1000 --launch-count 50 \
    -k regex:"launch_action_impl|_kernel_agent" \
    ...

Current Optimisatin Candidates

Reduce allocations/deallocations

Allocations and Deallocations

  • Allocations of GPU memory are relatively expensive
    • But individual allocations are cheap compared to kernels
  • Celeritas is an iterative process
    • Time for allocations/deallocations within the loop adds up
    • Calls to thrust methods per step which allocate and deallocate temporary memory


cudaMalloc(void** devPtr, size_t size)

cudaFree(void* devPtr)

Per-step Allocation and Deallocations

Nsys timeline showing the per-step allocations and dellocations within Thrust library calls

CCCL: Thrust, CUB & libcu++

  • CCCL (CUDA C++ Core Libraries) - github.com/NVIDIA/cccl

    • Thrust

    Thrust allows you to implement high performance parallel applications with minimal programming effort through a high-level interface

    • Cub

    CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model:

    • libcu++

    The C++ Standard Library for Your Entire System

remove_if_alive

size_type remove_if_alive(
    StateCollection<TrackSlotId, Ownership::reference, MemSpace::device> const&
        vacancies,
    StreamId stream_id)
{
    ScopedProfiling profile_this{"remove-if-alive"};
    auto start = device_pointer_cast(vacancies.data());
    auto end = thrust::remove_if(thrust_execute_on(stream_id),
                                 start,
                                 start + vacancies.size(),
                                 IsEqual{occupied()});
    CELER_DEVICE_CHECK_ERROR();
    // New size of the vacancy vector
    return end - start;
}

remove_if_alive

Timeline view of remove_if_alive

exclusive_scan_counts

size_type exclusive_scan_counts(
    StateCollection<size_type, Ownership::reference, MemSpace::device> const&
        counts,
    StreamId stream_id)
{
    ScopedProfiling profile_this{"exclusive-scan-conts"};
    // Exclusive scan:
    auto data = device_pointer_cast(counts.data());
    auto stop = thrust::exclusive_scan(thrust_execute_on(stream_id),
                                       data,
                                       data + counts.size(),
                                       data,
                                       size_type(0));
    CELER_DEVICE_CHECK_ERROR();
    // Copy the last element (accumulated total) back to host
    return *(stop - 1);
}

exclusive_scan_counts

Timeline view of exclusive_scan_counts

Thrust -> CUB

  • Replace thrust::remove_if with cub::copy_if
    • Only reallocate when temp memory required increases
    • The benefit of this is mostly masked through cudaMallocAsync
  • Replace thrust::exclusive_scan with cub::DeviceScan::ExclusiveScan
    • Only reallocate when temp memory required increases
  • Can potentially use a custom Thrust allocator instead
    • By extending thrust::cuda::allocator

Thrust -> CUB

  • Small absolute performance improvement.
    • Replace N allocations with between 1 and N
    • Replace N deallocations and 1
    • ~10s of lines of code
    • For TestEM3 13 TeV
      • ~10 µs per step improvement
      • < 0.2% improvement

2: CUDA Graph API

Kernel Launch Overheads

Launch overhead costs in smaller models (Simple CMS)

Kernel Launch Overheads

Launch overhead costs in smaller models (Simple CMS)

Kernel Launch Overheads

  • Overhead costs associated with submitting work to the GPU
    • I.e. gaps between kernels
  • Order of 1-10 µs per kernel launch
  • When many short running kernels are launched, this adds up.
  • 6087 kernel launches in 438ms of TestEM3 with a smaller input file.

CUDA Graph API

CUDA Graphs have been designed to allow work to be defined as graphs rather than single operations.

providing a mechanism to launch multiple GPU operations through a single CPU operation, and hence reduce overheads

developer.nvidia.com/blog/cuda-graphs

Simple example of Short Kernels

Simple kernel, ~6µs per call

#define N 200000

__global__ void shortKernel(float * out_d, float * in_d){
  int idx=blockIdx.x*blockDim.x+threadIdx.x;
  if(idx<N) out_d[idx]=1.23*in_d[idx];
}


  #define NSTEP 1000
  #define NKERNEL 20

  for(int istep=0; istep<NSTEP; istep++){
    for(int ikrnl=0; ikrnl<NKERNEL; ikrnl++){
      shortKernel<<<blocks, threads, 0, stream>>>(out_d, in_d);
      cudaStreamSynchronize(stream);
    }
  }

Short Kernels with stream sync

Example of many short kernel launches with a stream sync

Short Kernels with one sync per iteration

Example of many short kernel launches without stream synchrinisation

Short Kernels using the CUDA Graph API

Example of many short kernel launches using a CUDA Graph

Short Kernels using the CUDA Graph API

  bool graphCreated=false;
  cudaGraph_t graph;
  cudaGraphExec_t instance;
  for(int istep=0; istep<NSTEP; istep++){
    if(!graphCreated){
      cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
      for(int ikrnl=0; ikrnl<NKERNEL; ikrnl++){
        shortKernel<<<blocks, threads, 0, stream>>>(out_d, in_d);
      }
      cudaStreamEndCapture(stream, &graph);
      cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
      graphCreated=true;
    }
    cudaGraphLaunch(instance, stream);
    cudaStreamSynchronize(stream);
  }

CUDA Graph Performance

  • NVIDIA 3060ti
  • CUDA 12.3
  • 1000 “iterations” of 20 kernel launches
  • ~5-7µs per kernel without overheads
Method Elapsed Time (ms) Per Kernel w/ Overheads (µs)
Synchronized 231 11.55
Streams 141 7.05
Graph 124 6.20

Limitations

  • Absolute performance improvement will be small
    • More significant for simple geometries / fewer tracks
    • More improvement for more events & steps
  • May require:
    • Significant changes to host code and kernels
    • Multiple graphs per step
    • CUDA graph features from recent CUDA versions
      • May not yet be supported by ROCm (AMD)

3: Memory Access Pattern

Longest running kernels for TestEM3

Longest duration kernel in TestEM3 with an increased number of tracks
along-step-uniform-msc::propagate
IsAlongStepActionEqual with UniformFieldPropagatorFactory
void celeritas::detail::<unnamed>::launch_action_impl<
  celeritas::ConditionalTrackExecutor<celeritas::detail::IsAlongStepActionEqual,
  celeritas::detail::PropagationApplier<celeritas::detail::UniformFieldPropagatorFactory, void>>,
  celeritas::detail::PropagationApplier<celeritas::detail::UniformFieldPropagatorFactory, void>,
  (bool)>(celeritas::Range<celeritas::OpaqueId<celeritas::Thread_, unsigned int>>, T1)

along-step-uniform-msc::propagate

Nsight Compute “Speed-Of-Light” for a shorter invocation of the same along-step-uniform-msc::propagate kernel

along-step-uniform-msc::propagate

Nsight Compute Memory Anayslsis for a shorter invocation of the same along-step-uniform-msc::propagate kernel

along-step-uniform-msc::propagate

Nsight Compute Memory Anayslsis for a shorter invocation of the same along-step-uniform-msc::propagate kernel

GPU memory access pattern

  • GPUs perform best when memory accesses are coalesced
    • Neighbouring threads reads neighbouring elements of memory
  • Reads from global memory are performed 128 Bytes at a time
    • So for FP64 data best case is 2 transactions per request
    • Worst case 64 transactions

Coalesced vs Scattered Access Pattern

Sorting Tracks in Celertias

  • Sort tracks so neighbouring threads access neighbouring memory
  • But sorting is expensive - the cost may outweigh the benefits
  • Celeritas developers have considered this

    Sorting consistently improves the performance for the cms2018 problem but decreases the performance with the tilecal.

    … the radix sort kernel is a significant overhead in that case …

  • Can find a compromise
    • Sort infrequently
    • Different sort order

Grace-Hopper

Grace Hopper

  • Grace-Hopper GPUs are now being delivered and installed
    • Including at least one Tier 2 HPC
  • Don’t expect significant benefits compared to SXM H100
    • Limited host-device communication

Thank you

Additional Slides

CUDA Graph API

  • CUDA graph api is an alternate way of launching work
  • DAG of kernels and their dependencies
  • Replay sequences of kernels with less work
  • Can record the imperative method, and then replay
cudaStreamBeginCapture(...)
// normal invocation of kernels
my_kernel_a<<<...>>>(...)
my_kernel_a<<<...>>>(...)
// ...
cudaStreamEndCapture(...);
cudaGraphInstantiate(...);
// ... 
// Replay the sequence of kernels
cudaGraphLaunch(instance, stream);

CUDA Graph Capture costs

Timeline showing the first call which includes stream capture takes longer

along-step-uniform-msc::propagate

ncu Source counts for a shorter invocation of the same along-step-uniform-msc::propagate kernel. Unfortunately the profile I used for this talk was incomplete / without lineinfo