Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

view(data, idx) boundschecking is disproportionately expensive #1678

Closed
maleadt opened this issue Nov 23, 2022 · 2 comments · Fixed by JuliaGPU/GPUArrays.jl#499
Closed

view(data, idx) boundschecking is disproportionately expensive #1678

maleadt opened this issue Nov 23, 2022 · 2 comments · Fixed by JuliaGPU/GPUArrays.jl#499
Labels
performance How fast can we go?

Comments

@maleadt
Copy link
Member

maleadt commented Nov 23, 2022

It's expected that bounds checking costs something, but not 3 orders of magnitude as reported here with view(data, idx): https://discourse.julialang.org/t/correct-implementation-of-cuarrays-slicing-operations/90600. Can't this just throw a device-side exception instead?

@maleadt maleadt added bug Something isn't working performance How fast can we go? and removed bug Something isn't working labels Nov 23, 2022
@maleadt
Copy link
Member Author

maleadt commented Oct 31, 2023

MWE from Discourse:

using BenchmarkTools
using CUDA
CUDA.allowscalar(true)

# size of array
dsize = 10_000_000

# index
idx_h = collect(1:Int64(dsize/2))
idx_d = CuArray(idx_h)

# datasets
dt1_h = ones(dsize)
dt2_d = CuArray(dt1_h)
dt3_d = CuArray(dt1_h)
dt4_d = CuArray(dt1_h)

println("cpu version:")
@views function t1!(idx_h, dt1_h)
    dt1_h[idx_h] .+= 1.0
    return nothing
end
display(@benchmark t1!($idx_h, $dt1_h))

println("gpu version, data and index on gpu:")
@views function t2!(idx_d, dt2_d)
    dt2_d[idx_d] .+= 1.0
    return nothing
end
display(@benchmark CUDA.@sync t2!($idx_d, $dt2_d))

println("gpu version, data on gpu, index on cpu:")
@views function t3!(idx_h, dt3_d)
    dt3_d[idx_h] .+= 1.0
    return nothing
end
display(@benchmark CUDA.@sync t3!($idx_h, $dt3_d))

println("kernel version, data and index on gpu:")
function kernel!(idx_d, dt4_d, sizeidx)
    ix = (blockIdx().x-1)*blockDim().x+threadIdx().x
    if ixsizeidx
        dt4_d[idx_d[ix]] += 1.0
    end
    return nothing
end
function t4!(idx_d, dt4_d)
    tds = 768
    bls = cld(length(idx_d), 768)
    CUDA.@sync begin
        @cuda threads=tds blocks=bls kernel!(idx_d, dt4_d, length(idx_d))
    end
    return nothing
end
display(@benchmark t4!($idx_d, $dt4_d))

@maleadt
Copy link
Member Author

maleadt commented Oct 31, 2023

Comparison with JuliaGPU/GPUArrays.jl#499:

  • CPU version: 5ms
  • GPU version, data and index on GPU: was 13ms, now 125us
  • GPU version, data on GPU, index on CPU: 7ms
  • Kernel version, data and index on GPU: 23us

Basically, bounds check itself is now significantly faster. The case where the indices are on the CPU is still slow, but disabling the bounds check wouldn't help there, as the cost comes from uploading the indices to the GPU.

Compared to the custom kernel, there's still a cost of course, but that can be avoided by adding @inbounds:

julia> @views function t2!(idx_d, dt2_d)
           dt2_d[idx_d] .+= 1.0
           return nothing
       end
t2! (generic function with 1 method)

julia> display(@benchmark CUDA.@sync t2!($idx_d, $dt2_d))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  126.949 μs …  58.683 ms  ┊ GC (min … max): 0.00% … 27.96%
 Time  (median):     136.689 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   148.715 μs ± 823.513 μs  ┊ GC (mean ± σ):  2.14% ±  0.39%

                    ▂▂▃▃▅▄▄▆▇▇█▇▇▇▆▆▄▄▃▃▂▂▁
  ▂▁▁▂▂▂▂▃▃▃▄▄▅▅▆▇██████████████████████████▆▆▆▅▅▄▄▃▃▃▃▃▃▃▂▃▂▂▂ ▅
  127 μs           Histogram: frequency by time          147 μs <

 Memory estimate: 13.77 KiB, allocs estimate: 284.

julia> @views function t2!(idx_d, dt2_d)
           @inbounds dt2_d[idx_d] .+= 1.0
           return nothing
       end
t2! (generic function with 1 method)

julia> display(@benchmark CUDA.@sync t2!($idx_d, $dt2_d))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  25.319 μs … 164.049 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.330 μs ±   1.479 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                             ▁▃▄▄█▅▄▄▃▂
  ▁▁▁▁▁▁▁▁▂▂▂▂▃▃▃▄▃▄▄▄▄▄▅▅▆▇█████████████▇▆▆▅▄▄▄▄▃▃▃▃▂▂▂▂▂▂▁▂▁ ▃
  25.3 μs         Histogram: frequency by time         27.1 μs <

 Memory estimate: 1.64 KiB, allocs estimate: 22.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance How fast can we go?
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant