Skip to content

Latest commit

 

History

History

launch_overhead

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Launch overhead measurement

These tests allow measuring the overhead of launching a kernel, and comparing it to CUDA.

Use nvvp (the NVIDIA visual profiler) to visualize the overhead, disabling the option "Start execution with profiling enabled".

For example:

$ nvprof --profile-from-start off ./cuda
==9929== NVPROF is profiling process 9929, command: ./cuda
CPU time: 36.00us
GPU time: 30.82us
==9929== Profiling application: ./cuda
==9929== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  125.70us         5  25.139us  25.088us  25.281us  kernel_dummy

This shows how launching a kernel takes 36us from Julia's POV, 30 us when using event counters, but even that contains some overhead because according to nvprof the kernel only took 25 us.

Luckily, this was using CUDA, and CUDAdrv.jl doesn't perform much worse:

$ nvprof --profile-from-start off ./cuda.jl
==19694== NVPROF is profiling process 19694, command: julia ./cuda.jl
CPU time: 36.23us
GPU time: 31.62us
==19694== Profiling application: julia ./cuda.jl
==19694== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  125.70us         5  25.139us  25.088us  25.312us  kernel_dummy

But more importantly, CUDAnative.jl performs equally well:

$ nvprof --profile-from-start off ./cudanative.jl
==21135== NVPROF is profiling process 21135, command: julia ./cudanative.jl
CPU time: 36.42us
GPU time: 31.81us
==21135== Profiling application: julia ./cudanative.jl
==21135== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  123.78us         5  24.755us  24.704us  24.928us  julia_kernel_dummy_60488

Note that these are simple kernels, with more complex kernels Julia's heuristics start fighting us (eg. when dealing with long argument lists, inference performs worse and sometimes refuses to expand our generated functions).

Also, when dealing with more arguments there's an overhead caused by CUDA copying over arguments, and cannot be avoided. For use of hardware counters, see the CUPTI library.