surface_model
profilingPeter Heywood, Research Software Engineer
The University of Sheffield
2024-11-12
surface_model
& testRaytracing
surface_model
VecGeom is a geometry modeller library with hit-detection features as needed by particle detector simulation at the LHC and beyond
testRaytracing
test/surfaces/testRaytracing.{h/cpp/cu}
-oncpu 0
to disable cpu runs to to speed up profilingtestRaytracing
timeline CPU & GPUtestRaytracing
timeline -oncpu 0 -only_surf 1
GPU | CC | CPU | Cluster | Driver |
---|---|---|---|---|
V100 SXM2 | 70 | Intel Xeon Gold 6138 | TUoS Bessemer | 550.127.05 |
A100 SXM4 | 80 | AMD EPYC 7413 | TUoS Stanage | 550.127.05 |
H100 PCIe | 90 | AMD EPYC 7413 | TUoS Stanage | 550.127.05 |
GH200 | 90 | Nvidia Grace | N8CIR Bede | 560.35.03 |
Geometry | Touchables |
---|---|
trackML.gdml |
18790 |
TBHGCal181Oct_fixdup.gdml |
61802 |
cms2026D110DD4hep_fix.gdml |
13133900 |
LHCb_Upgrade_onlyECALandHCAL.gdml |
18429884 |
PropagateRaysSurf
& PropagateRaysSurfBVH
-nrays 524228 -use_TB_gun 1
on V100Method | Duration (s) |
---|---|
PropagateRaysSurf |
11.454 |
PropagateRaysSurfBVH |
3.428 |
PropagateRaysSurf
& PropagateRaysSurfBVH
kernel profilingncu --set full -o report.ncu-rep ./testRaytracing ...
32
threads which execute in lock-step
64
warps, 2048
threads for V100255
32-bit registers per thread in recent HPC GPUsPropagateRaysSurf
uses 255 reg/thread: 12.5%
occupancyAttempt to improve by:
-bvh_single_step
PropagateRaysSurfBVHSingle
- traverse a single stepfilterAliveRays
- compact the alive/inside rays for the next iteration250
registers per thread12.5%
occupancyMethod | Duration (s) |
---|---|
PropagateRaysSurf |
11.454 |
PropagateRaysSurfBVH |
3.428 |
bvh_single_step |
2.261 |
-bvh_split_step
PropagateRaysSurfBVHSingle
into
ComputeStepAndNextSurfaces
RelocateToNextVolumes
153
reg / thread for ComputeStepAndNextSurfaces
18.75%
occupancy218
reg / thread for RelocateToNextVolumes
12.5%
occupancyMethod | Duration (s) |
---|---|
PropagateRaysSurf |
11.454 |
PropagateRaysSurfBVH |
3.428 |
bvh_single_step |
2.261 |
bvh_split_step |
1.948 |
--maxrregcount=N
__maxnreg__
for CUDA >= 12.4__launch_bounds__
(less intuitive)-nrays 524228
, -use_TB_gun 1
, V100-maxrregcount=128
25%
theoretical occupancyPropagateRaysSurfBVH
Strategy | Reference | 128reg/thread |
---|---|---|
PropagateRaysSurf |
11.456 |
9.224 |
PropagateRaysSurfBVH |
3.430 |
3.515 |
bvh_single_step |
2.263 |
2.323 |
bvh_split_step |
1.948 |
2.145 |
bvh_single_step
and bvh_split_step
run indefinitely
testRaytracing.cu
uses a fixed number of threads per block of 32
trackML.gdml
, 524228
rays, V100Strategy | Reference Time(s) | Time(s) | Selected Blocksizes |
---|---|---|---|
PropagateRaysSurf |
2.006 |
2.206 |
256 |
PropagateRaysSurfBVH |
0.215 |
0.229 |
256 |
bvh_single_step |
0.127 |
0.131 |
256 & 1024 |
bvh_split_step |
0.116 |
0.174 |
384 , 256 & 1024 |
32
and 256
?524228
rays on GH200 in FP64VecGeom surface_model
profiling - SWIFT-HEP #8 Joint with ExaTEPP