-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
view(data, idx) boundschecking is disproportionately expensive #1678
Labels
performance
How fast can we go?
Comments
maleadt
added
bug
Something isn't working
performance
How fast can we go?
and removed
bug
Something isn't working
labels
Nov 23, 2022
MWE from Discourse: using BenchmarkTools
using CUDA
CUDA.allowscalar(true)
# size of array
dsize = 10_000_000
# index
idx_h = collect(1:Int64(dsize/2))
idx_d = CuArray(idx_h)
# datasets
dt1_h = ones(dsize)
dt2_d = CuArray(dt1_h)
dt3_d = CuArray(dt1_h)
dt4_d = CuArray(dt1_h)
println("cpu version:")
@views function t1!(idx_h, dt1_h)
dt1_h[idx_h] .+= 1.0
return nothing
end
display(@benchmark t1!($idx_h, $dt1_h))
println("gpu version, data and index on gpu:")
@views function t2!(idx_d, dt2_d)
dt2_d[idx_d] .+= 1.0
return nothing
end
display(@benchmark CUDA.@sync t2!($idx_d, $dt2_d))
println("gpu version, data on gpu, index on cpu:")
@views function t3!(idx_h, dt3_d)
dt3_d[idx_h] .+= 1.0
return nothing
end
display(@benchmark CUDA.@sync t3!($idx_h, $dt3_d))
println("kernel version, data and index on gpu:")
function kernel!(idx_d, dt4_d, sizeidx)
ix = (blockIdx().x-1)*blockDim().x+threadIdx().x
if ix≤sizeidx
dt4_d[idx_d[ix]] += 1.0
end
return nothing
end
function t4!(idx_d, dt4_d)
tds = 768
bls = cld(length(idx_d), 768)
CUDA.@sync begin
@cuda threads=tds blocks=bls kernel!(idx_d, dt4_d, length(idx_d))
end
return nothing
end
display(@benchmark t4!($idx_d, $dt4_d)) |
Comparison with JuliaGPU/GPUArrays.jl#499:
Basically, bounds check itself is now significantly faster. The case where the indices are on the CPU is still slow, but disabling the bounds check wouldn't help there, as the cost comes from uploading the indices to the GPU. Compared to the custom kernel, there's still a cost of course, but that can be avoided by adding
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It's expected that bounds checking costs something, but not 3 orders of magnitude as reported here with
view(data, idx)
: https://discourse.julialang.org/t/correct-implementation-of-cuarrays-slicing-operations/90600. Can't this just throw a device-side exception instead?The text was updated successfully, but these errors were encountered: