Peter Heywood, Research Software Engineer
The University of Sheffield
2024-01-23
Celeritas is a new Monte Carlo transport code designed for high-performance simulation of high-energy physics detectors.
The Celeritas project implements HEP detector physics on GPU accelerator hardware with the ultimate goal of supporting the massive computational requirements of the HL-LHC upgrade.
ncu
CELER_ENABLE_PROFILING
nsys
to only profile selected regionsncu
thrust
methods per step which allocate and deallocate temporary memoryCCCL (CUDA C++ Core Libraries) - github.com/NVIDIA/cccl
Thrust allows you to implement high performance parallel applications with minimal programming effort through a high-level interface
CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model:
The C++ Standard Library for Your Entire System
remove_if_alive
size_type remove_if_alive(
StateCollection<TrackSlotId, Ownership::reference, MemSpace::device> const&
vacancies,
StreamId stream_id)
{
ScopedProfiling profile_this{"remove-if-alive"};
auto start = device_pointer_cast(vacancies.data());
auto end = thrust::remove_if(thrust_execute_on(stream_id),
start,
start + vacancies.size(),
IsEqual{occupied()});
CELER_DEVICE_CHECK_ERROR();
// New size of the vacancy vector
return end - start;
}
remove_if_alive
exclusive_scan_counts
size_type exclusive_scan_counts(
StateCollection<size_type, Ownership::reference, MemSpace::device> const&
counts,
StreamId stream_id)
{
ScopedProfiling profile_this{"exclusive-scan-conts"};
// Exclusive scan:
auto data = device_pointer_cast(counts.data());
auto stop = thrust::exclusive_scan(thrust_execute_on(stream_id),
data,
data + counts.size(),
data,
size_type(0));
CELER_DEVICE_CHECK_ERROR();
// Copy the last element (accumulated total) back to host
return *(stop - 1);
}
exclusive_scan_counts
thrust::remove_if
with cub::copy_if
cudaMallocAsync
thrust::exclusive_scan
with cub::DeviceScan::ExclusiveScan
thrust::cuda::allocator
N
allocations with between 1
and N
N
deallocations and 1
~10s
of lines of code~10
µs per step improvement0.2%
improvement1-10
µs per kernel launch6087
kernel launches in 438
ms of TestEM3 with a smaller input file. CUDA Graphs have been designed to allow work to be defined as graphs rather than single operations.
providing a mechanism to launch multiple GPU operations through a single CPU operation, and hence reduce overheads
Simple kernel, ~6
µs per call
#define N 200000
__global__ void shortKernel(float * out_d, float * in_d){
int idx=blockIdx.x*blockDim.x+threadIdx.x;
if(idx<N) out_d[idx]=1.23*in_d[idx];
}
bool graphCreated=false;
cudaGraph_t graph;
cudaGraphExec_t instance;
for(int istep=0; istep<NSTEP; istep++){
if(!graphCreated){
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
for(int ikrnl=0; ikrnl<NKERNEL; ikrnl++){
shortKernel<<<blocks, threads, 0, stream>>>(out_d, in_d);
}
cudaStreamEndCapture(stream, &graph);
cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
graphCreated=true;
}
cudaGraphLaunch(instance, stream);
cudaStreamSynchronize(stream);
}
5-7
µs per kernel without overheadsMethod | Elapsed Time (ms) | Per Kernel w/ Overheads (µs) |
---|---|---|
Synchronized | 231 | 11.55 |
Streams | 141 | 7.05 |
Graph | 124 | 6.20 |
void celeritas::detail::<unnamed>::launch_action_impl<
celeritas::ConditionalTrackExecutor<celeritas::detail::IsAlongStepActionEqual,
celeritas::detail::PropagationApplier<celeritas::detail::UniformFieldPropagatorFactory, void>>,
celeritas::detail::PropagationApplier<celeritas::detail::UniformFieldPropagatorFactory, void>,
(bool)>(celeritas::Range<celeritas::OpaqueId<celeritas::Thread_, unsigned int>>, T1)
along-step-uniform-msc::propagate
along-step-uniform-msc::propagate
along-step-uniform-msc::propagate
Sorting consistently improves the performance for the cms2018 problem but decreases the performance with the tilecal.
… the radix sort kernel is a significant overhead in that case …
along-step-uniform-msc::propagate
Progress report: Profiling Celeritas on Nvidia GPUs - ExaTEPP workshop