Fix dynamic dispatch issues #2235

MilesCranmer · 2024-01-15T15:28:37Z

x-ref https://discourse.julialang.org/t/improving-gpu-performance-for-symbolic-regression/108800/10?u=milescranmer
@maleadt

maleadt · 2024-01-15T19:53:37Z

Thanks!

maleadt · 2024-01-17T13:23:59Z

I'm not sure if these changes, which make the code less readable, were worth it: on our benchmarks, there's no difference https://speed.juliagpu.org/timeline/#/?exe=11,9&env=1&base=none&ben=kernel/launch&revs=50&equid=off&quarts=on&extr=on

EDIT: actually, in isolation, I do see a difference:

julia> kernel(args...) = nothing

julia> k = @cuda launch=false kernel(1:10...)

julia> @benchmark cudacall(k.fun, Tuple{CUDA.KernelState, Vararg{Int64, 10}}, CUDA.KernelState(C_NULL,0), 1:10...)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.084 μs …  5.157 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.336 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.341 μs ± 59.996 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                     ▄█▆█▇▅▁▂▁
  ▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▃▃▃▄▄▆▆███████████▆▆▇▇▅▅▄▃▂▂▂▂▂ ▄
  2.08 μs        Histogram: frequency by time        2.46 μs <

 Memory estimate: 224 bytes, allocs estimate: 3.

julia> @benchmark cudacall(k.fun, Tuple{CUDA.KernelState, Vararg{Int64, 10}}, CUDA.KernelState(C_NULL,0), 1:10...)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.471 μs … 220.074 μs  ┊ GC (min … max): 0.00% … 96.21%
 Time  (median):     4.777 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.875 μs ±   2.168 μs  ┊ GC (mean ± σ):  0.43% ±  0.96%

               ▅▇█▆▄▁
  ▂▁▂▂▂▂▂▂▂▃▃▅███████▇▅▄▃▃▃▂▃▃▃▃▃▃▃▃▃▂▂▂▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  4.47 μs         Histogram: frequency by time        5.57 μs <

 Memory estimate: 880 bytes, allocs estimate: 14.

I'll push a commit that makes the new version slightly more readable :-)

MilesCranmer · 2024-01-17T14:49:25Z

Cool! Yeah I also saw a similar (or even larger) speedup from the changes. It really helps for tiny repeated kernel calls (which I am using) where it can help a lot :)

Also with a very large number of arguments it can help even more because that's where type inference on the cudaconvert can hit harder.

MilesCranmer added 2 commits January 15, 2024 15:24

Fix dynamic dispatch in kernel call

99904b2

Fix dynamic dispatch in state.jl

d5f48ec

MilesCranmer force-pushed the patch-2 branch from a3a9928 to 7639d3f Compare January 15, 2024 16:32

MilesCranmer changed the title ~~Fix dynamic dispatch issue~~ Fix dynamic dispatch issues Jan 15, 2024

Fix dynamic dispatch in cudacall

5ef43fa

MilesCranmer force-pushed the patch-2 branch from 7639d3f to 5ef43fa Compare January 15, 2024 17:58

maleadt merged commit a152355 into JuliaGPU:master Jan 15, 2024
1 check passed

MilesCranmer deleted the patch-2 branch January 16, 2024 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dynamic dispatch issues #2235

Fix dynamic dispatch issues #2235

MilesCranmer commented Jan 15, 2024 •

edited

Loading

maleadt commented Jan 15, 2024

maleadt commented Jan 17, 2024 •

edited

Loading

MilesCranmer commented Jan 17, 2024

Fix dynamic dispatch issues #2235

Fix dynamic dispatch issues #2235

Conversation

MilesCranmer commented Jan 15, 2024 • edited Loading

maleadt commented Jan 15, 2024

maleadt commented Jan 17, 2024 • edited Loading

MilesCranmer commented Jan 17, 2024

MilesCranmer commented Jan 15, 2024 •

edited

Loading

maleadt commented Jan 17, 2024 •

edited

Loading