Peter Heywood, Research Software Engineer
The University of Sheffield
2023-06-22
Must understand software performance to improve performance
Profile
Celeritas is a new Monte Carlo transport code designed for high-performance simulation of high-energy physics detectors.
The Celeritas project implements HEP detector physics on GPU accelerator hardware with the ultimate goal of supporting the massive computational requirements of the HL-LHC upgrade.
gprof
, perf
, Kcachegrind, VTune, …roctracer
rocsys
rocprofv2
-DCMAKE_BUILD_TYPE=Release
, -O3
-DCMAKE_BUILD_TYPE=RelWithDebInfo
, -O2 -g
c8db3fce
, v0.3.0
export CELER_DISABLE_PARALLEL=1
celer-sim
test casesimple-cms.gdml
gamma-3evt-15prim.hepmc3
ctest -R app/celer-sim:device
celer-g4
testem3-flat.gdml
testem3.1k.hepmc3
/control/verbose 0
/tracking/verbose 0
/run/verbose 0
/event/verbose 0
/celer/outputFile testem3-1k.out.json
/celer/maxNumTracks 524288
/celer/maxNumEvents 2048
/celer/maxInitializers 4194304
/celer/secondaryStackFactor 3
/celerg4/geometryFile /celeritas/test/celeritas/data/testem3-flat.gdml
/celerg4/eventFile /benchmarks/testem3.1k.hepmc3
88us
5.2us
16 * 256
-lineinfo
for line-level profiling--set=full
-s
, -c
etc.# All metrics, skip 64 kernels, capture 128.
ncu --set=full -s 64 -c 128 -o metrics celer-g4 input.mac
ncu-ui metrics.ncu-rep
--target-processes
-lineinfo
-lineinfo
cuda_arch
valueArch | Variant |
---|---|
Volta | variants: +cuda cuda_arch=70 cxxstd=17 |
Ampere | variants: +cuda cuda_arch=80 cxxstd=17 |
Hopper | variants: +cuda cuda_arch=90 cxxstd=17 |
nsys
nvidia/cuda:11.8.0-devel-ubuntu22.04
does not include nsys
Nsys 2022.4.2 (CUDA 11.8.0):
# Install nsys for profiling. ncu is included
RUN if [ "$DOCKERFILE_DISTRO" = "ubuntu" ] ; then \
apt-get -yqq update \
&& apt-get -yqq install --no-install-recommends nsight-systems-2022.4.2 \
&& rm -rf /var/lib/apt/lists/* ; \
fi
-ci
containers, which are smaller for bandwidth reasons.apptainer build img.sif docker://registry/image:tag
apptainer build img.sif docker-deamon:registry/image:tag
# Build the appropriate container
cd celeritas/scripts/Docker
# Build the cuda Docker container, sm_70. Wait ~90 minutes.
./build.sh cuda
# If the image hasn't been pushed to a registry, apptainer requires a local path, so save the image
rm -f docker-temp.tar && docker save $(docker images --format="{{.Repository}} {{.ID}}" | grep "celeritas/dev-jammy-cuda11" | sort -rk 2 | awk 'NR==1{print $2}') -o image.tar
# Convert to an apptainer container in the working dir
apptainer build -F celeritas-dev-jammy-cuda11.sif docker-archive:image.tar
# ephemeral, does not bind home dir by default
docker run --rm -ti --gpus all -v .:/src celeritas/dev-jammy-cuda11:2023-06-19
# apptainer, runs as the current user, with the calling users env vars and default bidnings
apptainer run --nv --bind ./:/celeritas-project celeritas-dev-jammy-cuda11-2023-06-19.sif
# TUoS HPC - this is not perfect
apptainer run --nv --bind ./:/celeritas-project /mnt/parscratch/users/ac1phey/celeritas-dev-jammy-cuda11-sm90.sif
Add -lineinfo
to
mkdir build-lineinfo && cd build-lineinfo
cmake .. -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_FLAGS_RELEASE="-O3 -DNDEBUG -lineinfo" -DCELERITAS_DEBUG=OFF
cmake --build . -j `nproc`
GPU Profiling with Celeritas - ExaTEPP workshop