Progress report: Profiling Celeritas on Nvidia GPUs

Peter Heywood, Research Software Engineer

The University of Sheffield

2024-01-23

Recap

June 2023 Workshop

“GPU Profiling with Celeritas”
Slides: ptheywood.uk/2023-06-22-gpu-profiling-with-celeritas
Profiling Celeritas on NVIDIA GPUs:
- Nsight Systems - Timeline Profiler
- Nsight Compute - Kernel Profiler
- NVIDIA Tools Extension (NVTX)

Celeritas

Celeritas is a new Monte Carlo transport code designed for high-performance simulation of high-energy physics detectors.

The Celeritas project implements HEP detector physics on GPU accelerator hardware with the ultimate goal of supporting the massive computational requirements of the HL-LHC upgrade.

github.com/celeritas-project/celeritas
NVIDIA GPUs via CUDA
AMD GPUs via HIP
Ben Morgan - “Detector Simulations in Particle Physics”

GPUs

Development machine:
- NVIDIA Titan V (SM 70, 250W)
- Intel i7-6850K
TUoS Stanage HPC (Tier 3):
- NVIDIA H100 PCI-e (SM 90, 350W)
- NVIDIA A100 SXM4 (SM 80, 500W)
- AMD EPYC 7413
- Exclusive reservation to use ncu

NVTX Annotations

Timeline without NVTX

Timeline showing 60ms portion of TestEM3 **without** `CELER_ENABLE_PROFILE`

`CELER_ENABLE_PROFILING`

Timeline showing 60ms portion of TestEM3 **with** `CELER_ENABLE_PROFILE=1` in PR #827

Nsight use of NVTX

Can encourage nsys to only profile selected regions

nsys profile -c nvtx -p celer-sim@celeritas ...

Can only profile specific kernels in ncu by range
- Requires recent enough ncu

ncu --set=full --nvtx --nvtx-include "celeritas@celer-sim/step/*" \
    --launch-skip 1000 --launch-count 50 \
    -k regex:"launch_action_impl|_kernel_agent" \
    ...

Current Optimisatin Candidates

Reduce allocations/deallocations

Allocations and Deallocations

Allocations of GPU memory are relatively expensive
- But individual allocations are cheap compared to kernels
Celeritas is an iterative process
- Time for allocations/deallocations within the loop adds up
- Calls to thrust methods per step which allocate and deallocate temporary memory

cudaMalloc(void** devPtr, size_t size)

cudaFree(void* devPtr)

Per-step Allocation and Deallocations

Nsys timeline showing the per-step allocations and dellocations within `Thrust` library calls

CCCL: Thrust, CUB & libcu++

CCCL (CUDA C++ Core Libraries) - github.com/NVIDIA/cccl
- Thrust
Thrust allows you to implement high performance parallel applications with minimal programming effort through a high-level interface
- Cub
CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model:
- libcu++
The C++ Standard Library for Your Entire System

`remove_if_alive`

size_type remove_if_alive(
    StateCollection<TrackSlotId, Ownership::reference, MemSpace::device> const&
        vacancies,
    StreamId stream_id)
{
    ScopedProfiling profile_this{"remove-if-alive"};
    auto start = device_pointer_cast(vacancies.data());
    auto end = thrust::remove_if(thrust_execute_on(stream_id),
                                 start,
                                 start + vacancies.size(),
                                 IsEqual{occupied()});
    CELER_DEVICE_CHECK_ERROR();
    // New size of the vacancy vector
    return end - start;
}

`remove_if_alive`

`exclusive_scan_counts`

size_type exclusive_scan_counts(
    StateCollection<size_type, Ownership::reference, MemSpace::device> const&
        counts,
    StreamId stream_id)
{
    ScopedProfiling profile_this{"exclusive-scan-conts"};
    // Exclusive scan:
    auto data = device_pointer_cast(counts.data());
    auto stop = thrust::exclusive_scan(thrust_execute_on(stream_id),
                                       data,
                                       data + counts.size(),
                                       data,
                                       size_type(0));
    CELER_DEVICE_CHECK_ERROR();
    // Copy the last element (accumulated total) back to host
    return *(stop - 1);
}

`exclusive_scan_counts`

Thrust -> CUB

Replace thrust::remove_if with cub::copy_if
- Only reallocate when temp memory required increases
- The benefit of this is mostly masked through cudaMallocAsync

Replace thrust::exclusive_scan with cub::DeviceScan::ExclusiveScan
- Only reallocate when temp memory required increases

Can potentially use a custom Thrust allocator instead
- By extending thrust::cuda::allocator

Thrust -> CUB

Small absolute performance improvement.
- Replace N allocations with between 1 and N
- Replace N deallocations and 1
- ~10s of lines of code
- For TestEM3 13 TeV
  - ~10 µs per step improvement
  - < 0.2% improvement

2: CUDA Graph API

Kernel Launch Overheads

Launch overhead costs in smaller models (Simple CMS)

Kernel Launch Overheads

Overhead costs associated with submitting work to the GPU
- I.e. gaps between kernels
Order of 1-10 µs per kernel launch
When many short running kernels are launched, this adds up.
6087 kernel launches in 438ms of TestEM3 with a smaller input file.

CUDA Graph API

CUDA Graphs have been designed to allow work to be defined as graphs rather than single operations.

providing a mechanism to launch multiple GPU operations through a single CPU operation, and hence reduce overheads

developer.nvidia.com/blog/cuda-graphs

Simple example of Short Kernels

Simple kernel, ~6µs per call

#define N 200000

__global__ void shortKernel(float * out_d, float * in_d){
  int idx=blockIdx.x*blockDim.x+threadIdx.x;
  if(idx<N) out_d[idx]=1.23*in_d[idx];
}

  #define NSTEP 1000
  #define NKERNEL 20

  for(int istep=0; istep<NSTEP; istep++){
    for(int ikrnl=0; ikrnl<NKERNEL; ikrnl++){
      shortKernel<<<blocks, threads, 0, stream>>>(out_d, in_d);
      cudaStreamSynchronize(stream);
    }
  }

Short Kernels with stream sync

Example of many short kernel launches with a stream sync

Short Kernels with one sync per iteration

Example of many short kernel launches without stream synchrinisation

Short Kernels using the CUDA Graph API

Example of many short kernel launches using a CUDA Graph

Short Kernels using the CUDA Graph API

  bool graphCreated=false;
  cudaGraph_t graph;
  cudaGraphExec_t instance;
  for(int istep=0; istep<NSTEP; istep++){
    if(!graphCreated){
      cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
      for(int ikrnl=0; ikrnl<NKERNEL; ikrnl++){
        shortKernel<<<blocks, threads, 0, stream>>>(out_d, in_d);
      }
      cudaStreamEndCapture(stream, &graph);
      cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
      graphCreated=true;
    }
    cudaGraphLaunch(instance, stream);
    cudaStreamSynchronize(stream);
  }

CUDA Graph Performance

NVIDIA 3060ti
CUDA 12.3
1000 “iterations” of 20 kernel launches
~5-7µs per kernel without overheads

Method	Elapsed Time (ms)	Per Kernel w/ Overheads (µs)
Synchronized	231	11.55
Streams	141	7.05
Graph	124	6.20

Limitations

Absolute performance improvement will be small
- More significant for simple geometries / fewer tracks
- More improvement for more events & steps

May require:
- Significant changes to host code and kernels
- Multiple graphs per step
- CUDA graph features from recent CUDA versions
  - May not yet be supported by ROCm (AMD)

3: Memory Access Pattern

Longest running kernels for TestEM3

void celeritas::detail::<unnamed>::launch_action_impl<
  celeritas::ConditionalTrackExecutor<celeritas::detail::IsAlongStepActionEqual,
  celeritas::detail::PropagationApplier<celeritas::detail::UniformFieldPropagatorFactory, void>>,
  celeritas::detail::PropagationApplier<celeritas::detail::UniformFieldPropagatorFactory, void>,
  (bool)>(celeritas::Range<celeritas::OpaqueId<celeritas::Thread_, unsigned int>>, T1)

`along-step-uniform-msc::propagate`

GPU memory access pattern

GPUs perform best when memory accesses are coalesced
- Neighbouring threads reads neighbouring elements of memory
Reads from global memory are performed 128 Bytes at a time
- So for FP64 data best case is 2 transactions per request
- Worst case 64 transactions

Coalesced vs Scattered Access Pattern

Sorting Tracks in Celertias

Sort tracks so neighbouring threads access neighbouring memory
But sorting is expensive - the cost may outweigh the benefits
Celeritas developers have considered this
- github.com/celeritas-project/celeritas/issues/675
Sorting consistently improves the performance for the cms2018 problem but decreases the performance with the tilecal.

… the radix sort kernel is a significant overhead in that case …
Can find a compromise
- Sort infrequently
- Different sort order

Grace-Hopper

Grace Hopper

Grace-Hopper GPUs are now being delivered and installed
- Including at least one Tier 2 HPC

Don’t expect significant benefits compared to SXM H100
- Limited host-device communication

Grace Hopper Superchip. Source: https://nvidianews.nvidia.com/news/gh200-grace-hopper-superchip-with-hbm3e-memory

Thank you

Additional Slides

CUDA Graph API

CUDA graph api is an alternate way of launching work
DAG of kernels and their dependencies
Replay sequences of kernels with less work
Can record the imperative method, and then replay

cudaStreamBeginCapture(...)
// normal invocation of kernels
my_kernel_a<<<...>>>(...)
my_kernel_a<<<...>>>(...)
// ...
cudaStreamEndCapture(...);
cudaGraphInstantiate(...);
// ... 
// Replay the sequence of kernels
cudaGraphLaunch(instance, stream);

CUDA Graph Capture costs

Timeline showing the first call which includes stream capture takes longer