Profiling Celeritas

Peter Heywood, Research Software Engineer

The University of Sheffield

2024-03-27

Context

Increase Science Throughput

  • Ever-increasing demand for increased simulation throughput
  1. Buy more / “better” hardware
  2. Improve Software
    • Improve implementations
    • Improve algorithms (i.e. work efficiency)
  • Must understand software performance to improve performance

Profile

Profiling Tools

  • CPU-only profilers
    • gprof, perf, Kcachegrind, VTune, …
  • AMD Profiling tools
    • roctracer
    • rocsys
    • rocprofv2

Celeritas

The Celeritas project implements HEP detector physics on GPU accelerator hardware with the ultimate goal of supporting the massive computational requirements of the HL-LHC upgrade.

Graphics Processing Unit(s)

  • Highly-parallel many-core co-processors
  • Optimised for throughput
  • (Relatively) Low volume of High-bandwidth memory
  • Power efficient (for suitable workloads)
  • Often connected via low-bandwidth PCIe

Titan Xp & Titan V GPUs

NVIDIA Grace Hopper Superchip

  • GH200 480GB
    • 72-core ARM CPU
    • 480GB LPDDR5X
    • H100 GPU (132 SMs)
    • 96GB HBM3e (4TB/s)
    • NVLink-C2C 900 GB/s bidirectional bandwidth
    • 450-1000W
  • 3 now included in the Bede Tier 2 HPC facility

NVIDIA Grace Hopper Superchip

Host-Device Bandwidth

GPU Host-Device Interconnect Bandwidth

Celeritas test suite on GH200

$ ctest
# ...
99% tests passed, 2 tests failed out of 203
$ ctest --rerun-failed --output-on-failure
# ... 
1/2 Test #158: celeritas/mat/Material ...........***Failed
    Error regular expression found in output. Regex=[tests FAILED]  0.68 sec
# ... 
2/2 Test #160: celeritas/phys/Particle ..........   Passed    0.61 sec

50% tests passed, 1 tests failed out of 2
JSON Comparison mass_radiation_coeff
Expected 0.03605392839455309
Actual 0.0360539283945531

Profiling Celeritas

Inputs / Configuration

  • Inputs should ideally be:
    • Representative of real-world use
    • Large enough to fully utilise hardware
    • Small enough to generate usable profile data
  • Optimised build
    • -DCMAKE_BUILD_TYPE=Release, -O3 -lineinfo
  • Celeritas v0.4.2 with VecGeom v1.2.4

Profiling Scenario

  • 16 Events
  • 1300 primaries per event
  • 1048576 track slots (max threads)
$ time ./bin/celer-sim cms2018+field+msc.json
# ...
real    1m13.997s
user    1m7.096s
sys     0m0.871s
{
    "_exe": "celer-sim",
    "_format": "celer-sim",
    "_geometry": "vecgeom",
    "_instance": 0,
    "_name": [
        "cms2018+field+msc",
        "vecgeom",
        "gpu"
    ],
    "_outdir": "cms2018+field+msc-vecgeom-gpu",
    "_timeout": 600.0,
    "_use_celeritas": true,
    "_version": "0.4.2",
    "action_diagnostic": false,
    "brem_combined": false,
    "cuda_heap_size": null,
    "cuda_stack_size": 8192,
    "default_stream": false,
    "environ": {},
    "event_file": null,
    "field": [
        0.0,
        0.0,
        1.0
    ],
    "field_options": {
        "delta_chord": 0.025,
        "delta_intersection": 1e-05,
        "epsilon_rel_max": 0.001,
        "epsilon_step": 1e-05,
        "errcon": 0.0001,
        "max_nsteps": 100,
        "max_stepping_decrease": 0.1,
        "max_stepping_increase": 5.0,
        "minimum_step": 1.0000000000000002e-06,
        "pgrow": -0.2,
        "pshrink": -0.25,
        "safety": 0.9
    },
    "geometry_file": "/path/to/cms2018.gdml",
    "initializer_capacity": 67108864,
    "max_events": 16,
    "max_steps": 32768,
    "mctruth_file": null,
    "mctruth_filter": null,
    "merge_events": true,
    "num_track_slots": 1048576,
    "physics_file": "",
    "physics_options": {
        "annihilation": true,
        "apply_cuts": false,
        "brems": "all",
        "compton_scattering": true,
        "coulomb_scattering": false,
        "default_cutoff": 0.1,
        "eloss_fluctuation": true,
        "em_bins_per_decade": 56,
        "gamma_conversion": true,
        "gamma_general": false,
        "integral_approach": true,
        "ionization": true,
        "linear_loss_limit": 0.01,
        "lowest_electron_energy": [
            0.001,
            "MeV"
        ],
        "lpm": true,
        "max_energy": [
            100000000.0,
            "MeV"
        ],
        "min_energy": [
            0.0001,
            "MeV"
        ],
        "msc": "urban",
        "msc_lambda_limit": 0.1,
        "msc_range_factor": 0.04,
        "msc_safety_factor": 0.6,
        "photoelectric": true,
        "rayleigh_scattering": false,
        "relaxation": "none",
        "verbose": false
    },
    "primary_options": {
        "direction": {
            "distribution": "isotropic",
            "params": []
        },
        "energy": {
            "distribution": "delta",
            "params": [
                10000.0
            ]
        },
        "num_events": 16,
        "pdg": [
            11
        ],
        "position": {
            "distribution": "delta",
            "params": [
                0.0,
                0.0,
                0.0
            ]
        },
        "primaries_per_event": 1300,
        "seed": 0
    },
    "secondary_stack_factor": 3.0,
    "seed": 20220904,
    "simple_calo": [],
    "step_diagnostic": false,
    "step_diagnostic_bins": null,
    "step_limiter": null,
    "sync": false,
    "track_order": "unsorted",
    "use_device": true,
    "warm_up": true,
    "write_track_counts": true
}

Power Consumption Monitoring

gpuutiliz -frequency 1 &
gupid=$!
./bin/celer-sim input.json
kill ${gupid}

GPU monitoring using willfurnass/gpuutiliz for celer-sim cms2018+field+msc on GH200, plotted via https://github.com/ptheywood/gpuutiliz-plotting

Nsight Systems

Nsight Systems

NVIDIA Nsight Systems logo

  • System-wide performance analysis
  • CPU + GPU
  • Visualise a timeline of events
  • CUDA API information, kernel block sizes, etc
  • Pascal GPUs or newer (SM 60+)
nsys profile -o timeline ./bin/celer-sim input.json
nsys-ui timeline.nsys-rep
  • Enable NVTX in Celeritas via CELER_ENABLE_PROFILING=1

nsys: Timeline

Nsys Timeline view for celer-sim cms2018+field+msc on GH200

nsys: Host-Device Communication

Nsys Timeline view for celer-sim cms2018+field+msc on GH200, showing the bulk of the host-device communication

nsys: Host-Device Communication

Nsys Timeline view for celer-sim cms2018+field+msc on GH200, showing the bulk of the host-device communication

242MB, 690μs @ 328GB/s

nsys: Longest Duration Kernel

Nsys Timeline view for celer-sim cms2018+field+msc on GH200, showing a single step including the longer running kernels. The summary view shows kernels sorted by duration

Nsight Compute

Nsight Compute

NVIDIA Nsight Compute logo

  • Detailed GPU performance metrics
  • Compile with -lineinfo for line-level profiling
  • Use --set=full for non-interactive profiling
  • Replays GPU kernels many times - significant runtime increase
  • Reduce captured kernels via filtering, -s, -c etc.
  • Volta+ (SM >= 70)
# All metrics, skip 64 kernels, capture 128.
ncu --set=full -s 64 -c 128 -o metrics.ncu-rep \
    ./bin/celer-sim input.json
ncu-ui metrics.ncu-rep

ncu: Summary

Nsight Compute UI showing the summary table for 100 kernel launches from celer-sim cms2018+field+msc on GH200

ncu: “Speed of Light”

Nsight Compute UI showing the 'speed of light' for the along-step kernel from celer-sim cms2018+field+msc on GH200

ncu: Scheduler

Nsight Compute UI showing the scheudler statistics for the along-step kernel from celer-sim cms2018+field+msc on GH200

ncu: Warp state

Nsight Compute UI showing the warpstate for the along-step kernel from celer-sim cms2018+field+msc on GH200

ncu: Occupancy

Nsight Compute UI showing the occupancy section for the along-step kernel from celer-sim cms2018+field+msc on GH200

ncu: Performance Monitor Sampling

Nsight Compute UI showing the performance monitoring for the along-step kernel from celer-sim cms2018+field+msc on GH200

  • Nsight Compute >= 2023.3 (distributed with CUDA 12.3)

ncu: Memory Access Pattern

Nsight Compute UI showing the memory diagram for the along-step kernel from celer-sim cms2018+field+msc on GH200

Thank you

UKRI “Shaping the Future of UK large-scale compute” survey
closes 29 March 2024 (this Friday!)

https://engagementhub.ukri.org/ukri-infrastructure/shaping-the-future-of-uk-large-scale-compute/

Additional Slides

Building Celeritas on GH200

  • Some warnings which can be suppressed via -Wno-psabi
  • GCC >= 10.1 on aarch64
include/VecGeom/base/Transformation3D.h: In member function
  ‘vecgeom::cxx::Vector3D<double> 
  vecgeom::cxx::Transformation3D::Translation() const’:
include/VecGeom/base/Transformation3D.h:213:3: note: parameter 
  passing for argument of type ‘vecgeom::cxx::Vector3D<double>’ 
  when C++17 is enabled changed to match C++14 in GCC 10.1
  213 |   {
      |   ^

Celeritas test suite on GH200

$ ctest
# ...
99% tests passed, 2 tests failed out of 203

Label Time Summary:
app           = 108.80 sec*proc (11 tests)
gpu           = 101.33 sec*proc (43 tests)
nomemcheck    = 107.88 sec*proc (9 tests)
unit          =  33.99 sec*proc (191 tests)

Total Test time (real) = 140.78 sec

The following tests FAILED:
        158 - celeritas/mat/Material (Failed)
        160 - celeritas/phys/Particle (SEGFAULT)

Celeritas test suite on GH200

$ ctest --rerun-failed --output-on-failure
# ... 
1/2 Test #158: celeritas/mat/Material ...........***Failed
    Error regular expression found in output. Regex=[tests FAILED]  0.68 sec
# ... 
2/2 Test #160: celeritas/phys/Particle ..........   Passed    0.61 sec

50% tests passed, 1 tests failed out of 2

The following tests FAILED:
        158 - celeritas/mat/Material (Failed)


JSON Comparison mass_radiation_coeff
Expected 0.03605392839455309
Actual 0.0360539283945531

Running the scenario

$ time ./bin/celer-sim cms2018+field+msc.json
status: Loading input and initializing problem data
status: Initializing Geant4 run manager
status: Initializing Geant4 geometry
info: Loading Geant4 geometry from GDML at /path/to/cms2018.gdml
status: Building Geant4 physics tables
status: Transferring data from Geant4
status: Loading external elemental data
status: Loading VecGeom geometry from GDML at /path/to/cms2018.gdml
status: Initializing tracking information
celeritas/src/celeritas/geo/GeoMaterialParams.cc:205: warning: Some geometry volumes do not have known material IDs: PixelForwardInnerDiskOuterRing_seg_1@0x7f4a9a837fc0, 

# ...

real    1m13.997s
user    1m7.096s
sys     0m0.871s

nsys: Timeline

Nsys Timeline view for celer-sim cms2018+field+msc on GH200

nsys: Host-Device Communication

Nsys Timeline view for celer-sim cms2018+field+msc on GH200, showing the bulk of the host-device communication

ncu: ERR_NVGPUCTRPERM

  • Nvidia profiler counters require root or security mitigation disabling since 418.43 (2019-02-22). See ERR_NVGPUCTRPERM.

ncu: Summary along-step

Nsight Compute UI showing the X for the slowest along-step kernel from celer-sim cms2018+field+msc on GH200

ncu: Compute

Nsight Compute UI showing the Compute section for the along-step kernel from celer-sim cms2018+field+msc on GH200

ncu: Instructions

Nsight Compute UI showing the instructions section for the along-step kernel from celer-sim cms2018+field+msc on GH200

ncu: Memory Access Pattern

Nsight Compute UI showing the memory table for the along-step kernel from celer-sim cms2018+field+msc on GH200

ncu: Launch Statistics

Nsight Compute UI showing the launch statistics for the along-step kernel from celer-sim cms2018+field+msc on GH200

ncu: Source

Nsight Compute UI showing the source counters for the along-step kernel from celer-sim cms2018+field+msc on GH200