GPU Profiling with Celeritas

Peter Heywood, Research Software Engineer

The University of Sheffield

2023-06-22

Increase Science Throughput

Ever-increasing demand for increased simulation throughput

Buy more / “better” hardware
Improve Software
- Improve implementations
- Improve algorithms (i.e. work efficiency)

Must understand software performance to improve performance

Profile

Celeritas

Celeritas is a new Monte Carlo transport code designed for high-performance simulation of high-energy physics detectors.

The Celeritas project implements HEP detector physics on GPU accelerator hardware with the ultimate goal of supporting the massive computational requirements of the HL-LHC upgrade.

github.com/celeritas-project/celeritas
NVIDIA GPUs via CUDA
AMD GPUs via HIP
Ben Morgan - “Detector Simulations in Particle Physics”

Profiling Tools

CPU-only profilers
- gprof, perf, Kcachegrind, VTune, …

NVIDIA Profiling tools
- Nsight Systems
- NVIDIA Nsight Compute
- nvprof

AMD Profiling tools
- roctracer
- rocsys
- rocprofv2

Hardware

Development machine:
- NVIDIA Titan V (SM 70, 250W)
- NVIDIA Titan RTX (SM 75, 280W)
  - 16x fewer FP64 units
- Intel i7-6850K
HPC:
- NVIDIA H100 PCI-e (SM 90, 350W)
- AMD EPYC 7413

Inputs / Configuration

Inputs should ideally be:
- Representative of real-world use
- Large enough to fully utilise hardware
- Small enough to generate usable profile data
Optimised build
- -DCMAKE_BUILD_TYPE=Release, -O3
- -DCMAKE_BUILD_TYPE=RelWithDebInfo, -O2 -g
Celeritas c8db3fce, v0.3.0
MPI disabled via export CELER_DISABLE_PARALLEL=1

Profile Scenario: Simple CMS

celer-sim test case
Simple geometry
Short running
simple-cms.gdml
gamma-3evt-15prim.hepmc3
ctest -R app/celer-sim:device

github.com/celeritas-project/benchmarks/geant4-validation-app/testem3_evd.png

Profile Scenario: TestEm3

celer-g4
testem3-flat.gdml
testem3.1k.hepmc3

/control/verbose 0
/tracking/verbose 0
/run/verbose 0
/event/verbose 0

/celer/outputFile testem3-1k.out.json
/celer/maxNumTracks 524288
/celer/maxNumEvents 2048
/celer/maxInitializers 4194304
/celer/secondaryStackFactor 3

/celerg4/geometryFile /celeritas/test/celeritas/data/testem3-flat.gdml
/celerg4/eventFile /benchmarks/testem3.1k.hepmc3

github.com/celeritas-project/benchmarks/geant4-validation-app/simple_cms_evd.png

Nsight Systems

System-wide performance analysis
CPU + GPU
Visualise a timeline of events
CUDA API information, kernel block sizes etc
Pascal GPUs or newer (SM 60+)

nsys profile -o timeline ./celer-g4 input.mac
nsys-ui timeline.nsys-rep

Nsight Systems: Simple CMS

Timeline: Simple CMS (Titan V)

Timeline view for simple cms

Timeline: Simple CMS (Titan V)

Timeline view for simple cms

Timeline: Simple CMS (Titan V)

Timeline view for simple cms

Nsys Table: Simple CMS (Titan V)

Longest kernel: 88us
Launch latency: 5.2us
Threads: 16 * 256

Nsight Systems: TestEm3

Timeline: TestEm3 (Titan V)

Timeline view for TestEm3

Timeline: TestEm3 (Titan V) Profiling Overheads

Timeline view for TestEm3

Timeline: TestEm3 (Titan V) GPU Init

Timeline view for TestEm3

Timeline: TestEm3 (Titan V) GPU region

Timeline view for TestEm3

Timeline: TestEm3 (Titan V) GPU region

Timeline view for TestEm3

Timeline: TestEm3 (Titan V) GPU region

Timeline view for TestEm3

Timeline: TestEm3 (Titan V) GPU region

Timeline view for TestEm3

Timeline: H100, Titan V, Titan RTX

Timeline view for simple cms

Code Annotation

NVIDIA Tools Extension (NVTX)
AMD ROCTX


void some_function() {

    for (int i = 0; i < 6; ++i) {

        std::this_thread::sleep_for(std::chrono::milliseconds{100});

    }

}

Example with NVTX annotation, from FLAME GPU

Code Annotation

NVIDIA Tools Extension (NVTX)
AMD ROCTX

#include <nvtx3/nvToolsExt.h>
void some_function() {
    nvtxRangePush(__FUNCTION__);
    for (int i = 0; i < 6; ++i) {
        nvtxRangePush("inner")
        std::this_thread::sleep_for(std::chrono::milliseconds{100});
        nvtxRangePop();
    }
    nvtxRangePop();
}

Example with NVTX annotation, from FLAME GPU

Nsight Compute

Detailed GPU performance metrics
Compile with -lineinfo for line-level profiling
Capture full metrics via --set=full
Replays GPU kernels many times - significant runtime increase
Reduce captured kernels via filtering, -s, -c etc.
SM 70+ (Volta)

# All metrics, skip 64 kernels, capture 128.
ncu --set=full -s 64 -c 128 -o metrics celer-g4 input.mac
ncu-ui metrics.ncu-rep

May require --target-processes
Nvidia profiler counters require root or security mitigation disabling since 418.43 (2019-02-22). See ERR_NVGPUCTRPERM.

Nsight Compute: TestEm3 (Titan V)

TestEM3 kernel: Summary

Nsight Compute titanv for 2719th kernel invocatin

TestEM3 2685th kernel: Speed of Light

Nsight Compute slowkernel for 2719th kernel invocatin

TestEM3 2719th kernel: Speed of Light

Nsight Compute speed for 2719th kernel invocatin

TestEM3 2719th kernel: Roofline

Nsight Compute roofline for 2719th kernel invocatin

TestEM3 2719th kernel: Compute

Nsight Compute compute for 2719th kernel invocatin

TestEM3 2719th kernel: Memory

Nsight Compute memory for 2719th kernel invocatin

TestEM3 2719th kernel: Warp State

Nsight Compute warpstate for 2719th kernel invocatin

TestEM3 2719th kernel: Occupancy

Nsight Compute occupancy for 2719th kernel invocatin

TestEM3 2719th kernel: Source Counters

Nsight Compute sourcecounters for 2719th kernel invocatin

Must be compiled with -lineinfo

TestEM3 2719th kernel: Source

Nsight Compute source for 2719th kernel invocatin

Must be compiled with -lineinfo

Profile your code

Additional Slides

TestEm3 H100

TestEm3 Titan V

TestEm3 Titan RTX

TestEM3 Titan RTX 2719th kernel: speed

Nsight Compute speed for 2719th kernel invocatin

Spack Compute Capability

Spack only accepts a single cuda_arch value
Requires a full dependency rebuild (~90 mins)

Arch	Variant
Volta	`variants: +cuda cuda_arch=70 cxxstd=17`
Ampere	`variants: +cuda cuda_arch=80 cxxstd=17`
Hopper	`variants: +cuda cuda_arch=90 cxxstd=17`

Dockerfile `nsys`

nvidia/cuda:11.8.0-devel-ubuntu22.04 does not include nsys

Nsys 2022.4.2 (CUDA 11.8.0):

# Install nsys for profiling. ncu is included
RUN if [ "$DOCKERFILE_DISTRO" = "ubuntu" ] ; then \
  apt-get -yqq update \
  && apt-get -yqq install --no-install-recommends nsight-systems-2022.4.2 \
  && rm -rf /var/lib/apt/lists/* ; \ 
fi

Note: nsys and ncu will be removed from the -ci containers, which are smaller for bandwidth reasons.

Docker to Apptainer / Singularity

apptainer/singularity build can convert docker files to apptainer images
from a registry via apptainer build img.sif docker://registry/image:tag
locally via deameon apptainer build img.sif docker-deamon:registry/image:tag
locally via docker archive files
https://apptainer.org/docs/user/main/docker_and_oci.html

# Build the appropriate container 
cd celeritas/scripts/Docker
# Build the cuda Docker container, sm_70. Wait ~90 minutes.
./build.sh cuda
# If the image hasn't been pushed to a registry, apptainer requires a local path, so save the image
rm -f docker-temp.tar && docker save $(docker images --format="{{.Repository}} {{.ID}}" | grep "celeritas/dev-jammy-cuda11" | sort -rk 2 | awk 'NR==1{print $2}') -o image.tar
# Convert to an apptainer container in the working dir
apptainer build -F celeritas-dev-jammy-cuda11.sif docker-archive:image.tar

Docker to Apptainer / Singularity

Docker and Apptainer have different defaults when executing images
- Default directory bindings
- environment variable mapping
- in-container user
- entrypoints
Likely need to run with various flags to achieve similar behaviour

Docker to Apptainer / Singularity

# ephemeral, does not bind home dir by default
docker run --rm -ti --gpus all -v .:/src celeritas/dev-jammy-cuda11:2023-06-19
# apptainer, runs as the current user, with the calling users env vars and default bidnings
apptainer run --nv --bind ./:/celeritas-project celeritas-dev-jammy-cuda11-2023-06-19.sif 
# TUoS HPC - this is not perfect
apptainer run --nv --bind ./:/celeritas-project /mnt/parscratch/users/ac1phey/celeritas-dev-jammy-cuda11-sm90.sif

lineinfo

Add -lineinfo to

mkdir build-lineinfo && cd build-lineinfo
cmake .. -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_FLAGS_RELEASE="-O3 -DNDEBUG -lineinfo" -DCELERITAS_DEBUG=OFF 
cmake --build . -j `nproc`

GPU Profiling with Celeritas

Increase Science Throughput

Celeritas

Profiling Tools

Hardware

Inputs / Configuration

Profile Scenario: Simple CMS

Profile Scenario: TestEm3

Nsight Systems

Nsight Systems

Nsight Systems: Simple CMS

Timeline: Simple CMS (Titan V)

Timeline: Simple CMS (Titan V)

Timeline: Simple CMS (Titan V)

Nsys Table: Simple CMS (Titan V)

Nsight Systems: TestEm3

Timeline: TestEm3 (Titan V)

Timeline: TestEm3 (Titan V) Profiling Overheads

Timeline: TestEm3 (Titan V) GPU Init

Timeline: TestEm3 (Titan V) GPU region

Timeline: TestEm3 (Titan V) GPU region

Timeline: TestEm3 (Titan V) GPU region

Timeline: TestEm3 (Titan V) GPU region

Timeline: H100, Titan V, Titan RTX

Code Annotation

Code Annotation

Code Annotation

Nsight Compute

Nsight Compute

Nsight Compute: TestEm3 (Titan V)

TestEM3 kernel: Summary

TestEM3 2685th kernel: Speed of Light

TestEM3 2719th kernel: Speed of Light

TestEM3 2719th kernel: Roofline

TestEM3 2719th kernel: Compute

TestEM3 2719th kernel: Memory

TestEM3 2719th kernel: Warp State

TestEM3 2719th kernel: Occupancy

TestEM3 2719th kernel: Source Counters

TestEM3 2719th kernel: Source

Profile your code

Additional Slides

TestEm3 H100

TestEm3 Titan V

TestEm3 Titan RTX

TestEM3 Titan RTX 2719th kernel: speed

Spack Compute Capability

Dockerfile nsys

Docker to Apptainer / Singularity

Docker to Apptainer / Singularity

Docker to Apptainer / Singularity

lineinfo

Dockerfile `nsys`