Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dynamic dispatch issues #2235

Merged
merged 3 commits into from
Jan 15, 2024
Merged

Fix dynamic dispatch issues #2235

merged 3 commits into from
Jan 15, 2024

Conversation

MilesCranmer
Copy link
Contributor

@MilesCranmer MilesCranmer commented Jan 15, 2024

@MilesCranmer MilesCranmer changed the title Fix dynamic dispatch issue Fix dynamic dispatch issues Jan 15, 2024
@maleadt maleadt merged commit a152355 into JuliaGPU:master Jan 15, 2024
1 check passed
@maleadt
Copy link
Member

maleadt commented Jan 15, 2024

Thanks!

@MilesCranmer MilesCranmer deleted the patch-2 branch January 16, 2024 01:56
@maleadt
Copy link
Member

maleadt commented Jan 17, 2024

I'm not sure if these changes, which make the code less readable, were worth it: on our benchmarks, there's no difference https://speed.juliagpu.org/timeline/#/?exe=11,9&env=1&base=none&ben=kernel/launch&revs=50&equid=off&quarts=on&extr=on

EDIT: actually, in isolation, I do see a difference:

julia> kernel(args...) = nothing

julia> k = @cuda launch=false kernel(1:10...)

julia> @benchmark cudacall(k.fun, Tuple{CUDA.KernelState, Vararg{Int64, 10}}, CUDA.KernelState(C_NULL,0), 1:10...)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.084 μs …  5.157 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.336 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.341 μs ± 59.996 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                     ▄█▆█▇▅▁▂▁
  ▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▃▃▃▄▄▆▆███████████▆▆▇▇▅▅▄▃▂▂▂▂▂ ▄
  2.08 μs        Histogram: frequency by time        2.46 μs <

 Memory estimate: 224 bytes, allocs estimate: 3.

julia> @benchmark cudacall(k.fun, Tuple{CUDA.KernelState, Vararg{Int64, 10}}, CUDA.KernelState(C_NULL,0), 1:10...)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.471 μs … 220.074 μs  ┊ GC (min … max): 0.00% … 96.21%
 Time  (median):     4.777 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.875 μs ±   2.168 μs  ┊ GC (mean ± σ):  0.43% ±  0.96%

               ▅▇█▆▄▁
  ▂▁▂▂▂▂▂▂▂▃▃▅███████▇▅▄▃▃▃▂▃▃▃▃▃▃▃▃▃▂▂▂▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  4.47 μs         Histogram: frequency by time        5.57 μs <

 Memory estimate: 880 bytes, allocs estimate: 14.

I'll push a commit that makes the new version slightly more readable :-)

@MilesCranmer
Copy link
Contributor Author

Cool! Yeah I also saw a similar (or even larger) speedup from the changes. It really helps for tiny repeated kernel calls (which I am using) where it can help a lot :)

Also with a very large number of arguments it can help even more because that's where type inference on the cudaconvert can hit harder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants