surface_model profilingPeter Heywood, Research Software Engineer
The University of Sheffield
2024-11-12
surface_model & testRaytracingsurface_modelVecGeom is a geometry modeller library with hit-detection features as needed by particle detector simulation at the LHC and beyond
testRaytracingtest/surfaces/testRaytracing.{h/cpp/cu}-oncpu 0 to disable cpu runs to to speed up profilingtestRaytracing timeline CPU & GPU16384 rays, -use_TB_gun 1, V100)testRaytracing timeline -oncpu 0 -only_surf 1524228 rays, -use_tb_gun 1, V100)| GPU | CC | CPU | Cluster | Driver |
|---|---|---|---|---|
| V100 SXM2 | 70 | Intel Xeon Gold 6138 | TUoS Bessemer | 550.127.05 |
| A100 SXM4 | 80 | AMD EPYC 7413 | TUoS Stanage | 550.127.05 |
| H100 PCIe | 90 | AMD EPYC 7413 | TUoS Stanage | 550.127.05 |
| GH200 | 90 | Nvidia Grace | N8CIR Bede | 560.35.03 |
| Geometry | Touchables |
|---|---|
trackML.gdml |
18790 |
TBHGCal181Oct_fixdup.gdml |
61802 |
cms2026D110DD4hep_fix.gdml |
13133900 |
LHCb_Upgrade_onlyECALandHCAL.gdml |
18429884 |
testRaytracing benchmarking with 10 Million RaysPropagateRaysSurf & PropagateRaysSurfBVH-nrays 524228 -use_TB_gun 1 on V100| Method | Duration (s) |
|---|---|
PropagateRaysSurf |
11.454 |
PropagateRaysSurfBVH |
3.428 |
PropagateRaysSurf & PropagateRaysSurfBVH kernel profilingncu --set full -o report.ncu-rep ./testRaytracing ...
32 threads which execute in lock-step
64 warps, 2048 threads for V100255 32-bit registers per thread in recent HPC GPUsPropagateRaysSurf uses 255 reg/thread: 12.5% occupancyAttempt to improve by:
-bvh_single_stepPropagateRaysSurfBVHSingle - traverse a single stepfilterAliveRays - compact the alive/inside rays for the next iteration250 registers per thread12.5% occupancy| Method | Duration (s) |
|---|---|
PropagateRaysSurf |
11.454 |
PropagateRaysSurfBVH |
3.428 |
bvh_single_step |
2.261 |
-bvh_split_stepPropagateRaysSurfBVHSingle into
ComputeStepAndNextSurfacesRelocateToNextVolumes 153 reg / thread for ComputeStepAndNextSurfaces
18.75% occupancy218 reg / thread for RelocateToNextVolumes
12.5% occupancy| Method | Duration (s) |
|---|---|
PropagateRaysSurf |
11.454 |
PropagateRaysSurfBVH |
3.428 |
bvh_single_step |
2.261 |
bvh_split_step |
1.948 |
--maxrregcount=N__maxnreg__ for CUDA >= 12.4__launch_bounds__ (less intuitive)-nrays 524228, -use_TB_gun 1 , V100-maxrregcount=12825% theoretical occupancyPropagateRaysSurfBVH| Strategy | Reference | 128reg/thread |
|---|---|---|
PropagateRaysSurf |
11.456 |
9.224 |
PropagateRaysSurfBVH |
3.430 |
3.515 |
bvh_single_step |
2.263 |
2.323 |
bvh_split_step |
1.948 |
2.145 |
-nrays 524228 on V100 with maximum register counts of 255 and 128bvh_single_step and bvh_split_step run indefinitely
testRaytracing.cu uses a fixed number of threads per block of 32
trackML.gdml, 524228 rays, V100| Strategy | Reference Time(s) | Time(s) | Selected Blocksizes |
|---|---|---|---|
PropagateRaysSurf |
2.006 |
2.206 |
256 |
PropagateRaysSurfBVH |
0.215 |
0.229 |
256 |
bvh_single_step |
0.127 |
0.131 |
256 & 1024 |
bvh_split_step |
0.116 |
0.174 |
384, 256 & 1024 |
32 and 256?trackML.gdmlTBHGCal181Oct_fixdup.gdmlLHCb_Upgrade_onlyECALandHCAL.gdmlcms2026D110DD4hep_fix.gdml524228 rays on GH200 in FP64524228 rays on GH200VecGeom surface_model profiling - SWIFT-HEP #8 Joint with ExaTEPP