Peter Heywood, Research Software Engineer
The University of Sheffield
2023-06-22
Must understand software performance to improve performance
Profile
Celeritas is a new Monte Carlo transport code designed for high-performance simulation of high-energy physics detectors.
The Celeritas project implements HEP detector physics on GPU accelerator hardware with the ultimate goal of supporting the massive computational requirements of the HL-LHC upgrade.
gprof, perf, Kcachegrind, VTune, …roctracerrocsysrocprofv2-DCMAKE_BUILD_TYPE=Release, -O3-DCMAKE_BUILD_TYPE=RelWithDebInfo, -O2 -gc8db3fce, v0.3.0export CELER_DISABLE_PARALLEL=1 celer-sim test casesimple-cms.gdmlgamma-3evt-15prim.hepmc3ctest -R app/celer-sim:deviceceler-g4testem3-flat.gdmltestem3.1k.hepmc3/control/verbose 0
/tracking/verbose 0
/run/verbose 0
/event/verbose 0
/celer/outputFile testem3-1k.out.json
/celer/maxNumTracks 524288
/celer/maxNumEvents 2048
/celer/maxInitializers 4194304
/celer/secondaryStackFactor 3
/celerg4/geometryFile /celeritas/test/celeritas/data/testem3-flat.gdml
/celerg4/eventFile /benchmarks/testem3.1k.hepmc3
88us5.2us16 * 256
void some_function() {
for (int i = 0; i < 6; ++i) {
std::this_thread::sleep_for(std::chrono::milliseconds{100});
}
}#include <nvtx3/nvToolsExt.h>
void some_function() {
nvtxRangePush(__FUNCTION__);
for (int i = 0; i < 6; ++i) {
nvtxRangePush("inner")
std::this_thread::sleep_for(std::chrono::milliseconds{100});
nvtxRangePop();
}
nvtxRangePop();
}-lineinfo for line-level profiling--set=full-s, -c etc.# All metrics, skip 64 kernels, capture 128.
ncu --set=full -s 64 -c 128 -o metrics celer-g4 input.mac
ncu-ui metrics.ncu-rep--target-processes-lineinfo-lineinfocuda_arch value| Arch | Variant |
|---|---|
| Volta | variants: +cuda cuda_arch=70 cxxstd=17 |
| Ampere | variants: +cuda cuda_arch=80 cxxstd=17 |
| Hopper | variants: +cuda cuda_arch=90 cxxstd=17 |
nsysnvidia/cuda:11.8.0-devel-ubuntu22.04 does not include nsys
Nsys 2022.4.2 (CUDA 11.8.0):
# Install nsys for profiling. ncu is included
RUN if [ "$DOCKERFILE_DISTRO" = "ubuntu" ] ; then \
apt-get -yqq update \
&& apt-get -yqq install --no-install-recommends nsight-systems-2022.4.2 \
&& rm -rf /var/lib/apt/lists/* ; \
fi-ci containers, which are smaller for bandwidth reasons.apptainer build img.sif docker://registry/image:tagapptainer build img.sif docker-deamon:registry/image:tag# Build the appropriate container
cd celeritas/scripts/Docker
# Build the cuda Docker container, sm_70. Wait ~90 minutes.
./build.sh cuda
# If the image hasn't been pushed to a registry, apptainer requires a local path, so save the image
rm -f docker-temp.tar && docker save $(docker images --format="{{.Repository}} {{.ID}}" | grep "celeritas/dev-jammy-cuda11" | sort -rk 2 | awk 'NR==1{print $2}') -o image.tar
# Convert to an apptainer container in the working dir
apptainer build -F celeritas-dev-jammy-cuda11.sif docker-archive:image.tar# ephemeral, does not bind home dir by default
docker run --rm -ti --gpus all -v .:/src celeritas/dev-jammy-cuda11:2023-06-19
# apptainer, runs as the current user, with the calling users env vars and default bidnings
apptainer run --nv --bind ./:/celeritas-project celeritas-dev-jammy-cuda11-2023-06-19.sif
# TUoS HPC - this is not perfect
apptainer run --nv --bind ./:/celeritas-project /mnt/parscratch/users/ac1phey/celeritas-dev-jammy-cuda11-sm90.sif
Add -lineinfo to
mkdir build-lineinfo && cd build-lineinfo
cmake .. -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_FLAGS_RELEASE="-O3 -DNDEBUG -lineinfo" -DCELERITAS_DEBUG=OFF
cmake --build . -j `nproc`
GPU Profiling with Celeritas - ExaTEPP workshop