view(data, idx) boundschecking is disproportionately expensive #1678

maleadt · 2022-11-23T16:34:28Z

It's expected that bounds checking costs something, but not 3 orders of magnitude as reported here with view(data, idx): https://discourse.julialang.org/t/correct-implementation-of-cuarrays-slicing-operations/90600. Can't this just throw a device-side exception instead?

The text was updated successfully, but these errors were encountered:

maleadt · 2023-10-31T10:25:11Z

MWE from Discourse:

using BenchmarkTools
using CUDA
CUDA.allowscalar(true)

# size of array
dsize = 10_000_000

# index
idx_h = collect(1:Int64(dsize/2))
idx_d = CuArray(idx_h)

# datasets
dt1_h = ones(dsize)
dt2_d = CuArray(dt1_h)
dt3_d = CuArray(dt1_h)
dt4_d = CuArray(dt1_h)

println("cpu version:")
@views function t1!(idx_h, dt1_h)
    dt1_h[idx_h] .+= 1.0
    return nothing
end
display(@benchmark t1!($idx_h, $dt1_h))

println("gpu version, data and index on gpu:")
@views function t2!(idx_d, dt2_d)
    dt2_d[idx_d] .+= 1.0
    return nothing
end
display(@benchmark CUDA.@sync t2!($idx_d, $dt2_d))

println("gpu version, data on gpu, index on cpu:")
@views function t3!(idx_h, dt3_d)
    dt3_d[idx_h] .+= 1.0
    return nothing
end
display(@benchmark CUDA.@sync t3!($idx_h, $dt3_d))

println("kernel version, data and index on gpu:")
function kernel!(idx_d, dt4_d, sizeidx)
    ix = (blockIdx().x-1)*blockDim().x+threadIdx().x
    if ix≤sizeidx
        dt4_d[idx_d[ix]] += 1.0
    end
    return nothing
end
function t4!(idx_d, dt4_d)
    tds = 768
    bls = cld(length(idx_d), 768)
    CUDA.@sync begin
        @cuda threads=tds blocks=bls kernel!(idx_d, dt4_d, length(idx_d))
    end
    return nothing
end
display(@benchmark t4!($idx_d, $dt4_d))

maleadt · 2023-10-31T10:28:13Z

Comparison with JuliaGPU/GPUArrays.jl#499:

CPU version: 5ms
GPU version, data and index on GPU: was 13ms, now 125us
GPU version, data on GPU, index on CPU: 7ms
Kernel version, data and index on GPU: 23us

Basically, bounds check itself is now significantly faster. The case where the indices are on the CPU is still slow, but disabling the bounds check wouldn't help there, as the cost comes from uploading the indices to the GPU.

Compared to the custom kernel, there's still a cost of course, but that can be avoided by adding @inbounds:

julia> @views function t2!(idx_d, dt2_d)
           dt2_d[idx_d] .+= 1.0
           return nothing
       end
t2! (generic function with 1 method)

julia> display(@benchmark CUDA.@sync t2!($idx_d, $dt2_d))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  126.949 μs …  58.683 ms  ┊ GC (min … max): 0.00% … 27.96%
 Time  (median):     136.689 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   148.715 μs ± 823.513 μs  ┊ GC (mean ± σ):  2.14% ±  0.39%

                    ▂▂▃▃▅▄▄▆▇▇█▇▇▇▆▆▄▄▃▃▂▂▁
  ▂▁▁▂▂▂▂▃▃▃▄▄▅▅▆▇██████████████████████████▆▆▆▅▅▄▄▃▃▃▃▃▃▃▂▃▂▂▂ ▅
  127 μs           Histogram: frequency by time          147 μs <

 Memory estimate: 13.77 KiB, allocs estimate: 284.

julia> @views function t2!(idx_d, dt2_d)
           @inbounds dt2_d[idx_d] .+= 1.0
           return nothing
       end
t2! (generic function with 1 method)

julia> display(@benchmark CUDA.@sync t2!($idx_d, $dt2_d))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  25.319 μs … 164.049 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.330 μs ±   1.479 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                             ▁▃▄▄█▅▄▄▃▂
  ▁▁▁▁▁▁▁▁▂▂▂▂▃▃▃▄▃▄▄▄▄▄▅▅▆▇█████████████▇▆▆▅▄▄▄▄▃▃▃▃▂▂▂▂▂▂▁▂▁ ▃
  25.3 μs         Histogram: frequency by time         27.1 μs <

 Memory estimate: 1.64 KiB, allocs estimate: 22.

maleadt added bug Something isn't working performance How fast can we go? and removed bug Something isn't working labels Nov 23, 2022

maleadt mentioned this issue Dec 5, 2022

GPU performance issues JuliaMolSim/DFTK.jl#794

Open

maleadt mentioned this issue Oct 31, 2023

Rework host indexing. JuliaGPU/GPUArrays.jl#499

Merged

maleadt closed this as completed in JuliaGPU/GPUArrays.jl#499 Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

view(data, idx) boundschecking is disproportionately expensive #1678

view(data, idx) boundschecking is disproportionately expensive #1678

maleadt commented Nov 23, 2022

maleadt commented Oct 31, 2023 •

edited

Loading

maleadt commented Oct 31, 2023

view(data, idx) boundschecking is disproportionately expensive #1678

view(data, idx) boundschecking is disproportionately expensive #1678

Comments

maleadt commented Nov 23, 2022

maleadt commented Oct 31, 2023 • edited Loading

maleadt commented Oct 31, 2023

maleadt commented Oct 31, 2023 •

edited

Loading