GPU Profiling with Celeritas

Peter Heywood, Research Software Engineer

The University of Sheffield

2023-06-22

Increase Science Throughput

  • Ever-increasing demand for increased simulation throughput
  1. Buy more / “better” hardware
  2. Improve Software
    • Improve implementations
    • Improve algorithms (i.e. work efficiency)

Must understand software performance to improve performance

Profile

Celeritas

Celeritas is a new Monte Carlo transport code designed for high-performance simulation of high-energy physics detectors.

The Celeritas project implements HEP detector physics on GPU accelerator hardware with the ultimate goal of supporting the massive computational requirements of the HL-LHC upgrade.

Profiling Tools

  • CPU-only profilers
    • gprof, perf, Kcachegrind, VTune, …
  • AMD Profiling tools
    • roctracer
    • rocsys
    • rocprofv2

Hardware

  • Development machine:
    • NVIDIA Titan V (SM 70, 250W)
    • NVIDIA Titan RTX (SM 75, 280W)
      • 16x fewer FP64 units
    • Intel i7-6850K
  • HPC:
    • NVIDIA H100 PCI-e (SM 90, 350W)
    • AMD EPYC 7413

Titan Xp & Titan V GPUs

Inputs / Configuration

  • Inputs should ideally be:
    • Representative of real-world use
    • Large enough to fully utilise hardware
    • Small enough to generate usable profile data
  • Optimised build
    • -DCMAKE_BUILD_TYPE=Release, -O3
    • -DCMAKE_BUILD_TYPE=RelWithDebInfo, -O2 -g
  • Celeritas c8db3fce, v0.3.0
  • MPI disabled via export CELER_DISABLE_PARALLEL=1

Profile Scenario: Simple CMS

  • celer-sim test case
  • Simple geometry
  • Short running
  • simple-cms.gdml
  • gamma-3evt-15prim.hepmc3
  • ctest -R app/celer-sim:device

github.com/celeritas-project/benchmarks/geant4-validation-app/testem3_evd.png

Profile Scenario: TestEm3

  • celer-g4
  • testem3-flat.gdml
  • testem3.1k.hepmc3
/control/verbose 0
/tracking/verbose 0
/run/verbose 0
/event/verbose 0

/celer/outputFile testem3-1k.out.json
/celer/maxNumTracks 524288
/celer/maxNumEvents 2048
/celer/maxInitializers 4194304
/celer/secondaryStackFactor 3

/celerg4/geometryFile /celeritas/test/celeritas/data/testem3-flat.gdml
/celerg4/eventFile /benchmarks/testem3.1k.hepmc3

github.com/celeritas-project/benchmarks/geant4-validation-app/simple_cms_evd.png

Nsight Systems

Nsight Systems

  • System-wide performance analysis
  • CPU + GPU
  • Visualise a timeline of events
  • CUDA API information, kernel block sizes etc
  • Pascal GPUs or newer (SM 60+)
nsys profile -o timeline ./celer-g4 input.mac
nsys-ui timeline.nsys-rep

Nsight Systems: Simple CMS

Timeline: Simple CMS (Titan V)

Timeline view for simple cms

Timeline: Simple CMS (Titan V)

Timeline view for simple cms

Timeline: Simple CMS (Titan V)

Timeline view for simple cms

Nsys Table: Simple CMS (Titan V)

  • Longest kernel: 88us
  • Launch latency: 5.2us
  • Threads: 16 * 256

Nsight Systems: TestEm3

Timeline: TestEm3 (Titan V)

Timeline view for TestEm3

Timeline: TestEm3 (Titan V) Profiling Overheads

Timeline view for TestEm3

Timeline: TestEm3 (Titan V) GPU Init

Timeline view for TestEm3

Timeline: TestEm3 (Titan V) GPU region

Timeline view for TestEm3

Timeline: TestEm3 (Titan V) GPU region

Timeline view for TestEm3

Timeline: TestEm3 (Titan V) GPU region

Timeline view for TestEm3

Timeline: TestEm3 (Titan V) GPU region

Timeline view for TestEm3

Timeline: H100, Titan V, Titan RTX

Timeline view for simple cms

Code Annotation

Code Annotation


void some_function() {

    for (int i = 0; i < 6; ++i) {

        std::this_thread::sleep_for(std::chrono::milliseconds{100});

    }

}

Example with NVTX annotation, from FLAME GPU

Code Annotation

#include <nvtx3/nvToolsExt.h>
void some_function() {
    nvtxRangePush(__FUNCTION__);
    for (int i = 0; i < 6; ++i) {
        nvtxRangePush("inner")
        std::this_thread::sleep_for(std::chrono::milliseconds{100});
        nvtxRangePop();
    }
    nvtxRangePop();
}

Example with NVTX annotation, from FLAME GPU

Nsight Compute

Nsight Compute

  • Detailed GPU performance metrics
  • Compile with -lineinfo for line-level profiling
  • Capture full metrics via --set=full
  • Replays GPU kernels many times - significant runtime increase
  • Reduce captured kernels via filtering, -s, -c etc.
  • SM 70+ (Volta)
# All metrics, skip 64 kernels, capture 128.
ncu --set=full -s 64 -c 128 -o metrics celer-g4 input.mac
ncu-ui metrics.ncu-rep
  • May require --target-processes
  • Nvidia profiler counters require root or security mitigation disabling since 418.43 (2019-02-22). See ERR_NVGPUCTRPERM.

Nsight Compute: TestEm3 (Titan V)

TestEM3 kernel: Summary

Nsight Compute titanv for 2719th kernel invocatin

TestEM3 2685th kernel: Speed of Light

Nsight Compute slowkernel for 2719th kernel invocatin

TestEM3 2719th kernel: Speed of Light

Nsight Compute speed for 2719th kernel invocatin

TestEM3 2719th kernel: Roofline

Nsight Compute roofline for 2719th kernel invocatin

TestEM3 2719th kernel: Compute

Nsight Compute compute for 2719th kernel invocatin

TestEM3 2719th kernel: Memory

Nsight Compute memory for 2719th kernel invocatin

TestEM3 2719th kernel: Warp State

Nsight Compute warpstate for 2719th kernel invocatin

TestEM3 2719th kernel: Occupancy

Nsight Compute occupancy for 2719th kernel invocatin

TestEM3 2719th kernel: Source Counters

Nsight Compute sourcecounters for 2719th kernel invocatin

  • Must be compiled with -lineinfo

TestEM3 2719th kernel: Source

Nsight Compute source for 2719th kernel invocatin

  • Must be compiled with -lineinfo

Profile your code

Additional Slides

TestEm3 H100

TestEm3 Titan V

TestEm3 Titan RTX

TestEM3 Titan RTX 2719th kernel: speed

Nsight Compute speed for 2719th kernel invocatin

Spack Compute Capability

  • Spack only accepts a single cuda_arch value
  • Requires a full dependency rebuild (~90 mins)
Arch Variant
Volta variants: +cuda cuda_arch=70 cxxstd=17
Ampere variants: +cuda cuda_arch=80 cxxstd=17
Hopper variants: +cuda cuda_arch=90 cxxstd=17

Dockerfile nsys

nvidia/cuda:11.8.0-devel-ubuntu22.04 does not include nsys

Nsys 2022.4.2 (CUDA 11.8.0):

# Install nsys for profiling. ncu is included
RUN if [ "$DOCKERFILE_DISTRO" = "ubuntu" ] ; then \
  apt-get -yqq update \
  && apt-get -yqq install --no-install-recommends nsight-systems-2022.4.2 \
  && rm -rf /var/lib/apt/lists/* ; \ 
fi
  • Note: nsys and ncu will be removed from the -ci containers, which are smaller for bandwidth reasons.

Docker to Apptainer / Singularity

  • apptainer/singularity build can convert docker files to apptainer images
  • from a registry via apptainer build img.sif docker://registry/image:tag
  • locally via deameon apptainer build img.sif docker-deamon:registry/image:tag
  • locally via docker archive files
  • https://apptainer.org/docs/user/main/docker_and_oci.html
# Build the appropriate container 
cd celeritas/scripts/Docker
# Build the cuda Docker container, sm_70. Wait ~90 minutes.
./build.sh cuda
# If the image hasn't been pushed to a registry, apptainer requires a local path, so save the image
rm -f docker-temp.tar && docker save $(docker images --format="{{.Repository}} {{.ID}}" | grep "celeritas/dev-jammy-cuda11" | sort -rk 2 | awk 'NR==1{print $2}') -o image.tar
# Convert to an apptainer container in the working dir
apptainer build -F celeritas-dev-jammy-cuda11.sif docker-archive:image.tar

Docker to Apptainer / Singularity

  • Docker and Apptainer have different defaults when executing images
    • Default directory bindings
    • environment variable mapping
    • in-container user
    • entrypoints
  • Likely need to run with various flags to achieve similar behaviour

Docker to Apptainer / Singularity

# ephemeral, does not bind home dir by default
docker run --rm -ti --gpus all -v .:/src celeritas/dev-jammy-cuda11:2023-06-19
# apptainer, runs as the current user, with the calling users env vars and default bidnings
apptainer run --nv --bind ./:/celeritas-project celeritas-dev-jammy-cuda11-2023-06-19.sif 
# TUoS HPC - this is not perfect
apptainer run --nv --bind ./:/celeritas-project /mnt/parscratch/users/ac1phey/celeritas-dev-jammy-cuda11-sm90.sif

lineinfo

Add -lineinfo to

mkdir build-lineinfo && cd build-lineinfo
cmake .. -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_FLAGS_RELEASE="-O3 -DNDEBUG -lineinfo" -DCELERITAS_DEBUG=OFF 
cmake --build . -j `nproc`