at-benchmark captures GPU arrays #156

mohamed82008 · 2018-11-19T12:17:28Z

The following code triggers this error:

ERROR: CUDA error: out of memory (code JuliaGPU/CuArrays.jl#2, ERROR_OUT_OF_MEMORY)
Stacktrace:
 [1] macro expansion at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\base.jl:147 [inlined]
 [2] #alloc#3(::CUDAdrv.Mem.CUmem_attach, ::Function, ::Int64, ::Bool) at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\memory.jl:161
 [3] alloc at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\memory.jl:157 [inlined] (repeats 2 times)
 [4] CuArray{Float32,2}(::Tuple{Int64,Int64}) at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\array.jl:33
 [5] similar at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\array.jl:83 [inlined]
 [6] similar at .\abstractarray.jl:571 [inlined]
 [7] h_bench(::Int64, ::Int64) at .\REPL[4]:7
 [8] macro expansion at .\show.jl:555 [inlined]
 [9] top-level scope at .\REPL[6]:2 [inlined]
 [10] top-level scope at .\none:0

using CUDAdrv, CUDAnative, BenchmarkTools

function kernel_vadd(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]
    return nothing
end

function h(m, n)
	# CUDAdrv functionality: generate and upload data
	a = round.(rand(Float32, (m, n)) * 100)
	b = round.(rand(Float32, (m, n)) * 100)
	d_a = CuArray(a)
	d_b = CuArray(b)
	d_c = similar(d_a)  # output array

	@cuda threads=12 kernel_vadd(d_a, d_b, d_c)
end

function h_bench(m, n)
	# CUDAdrv functionality: generate and upload data
	a = round.(rand(Float32, (m, n)) * 100)
	b = round.(rand(Float32, (m, n)) * 100)
	d_a = CuArray(a)
	d_b = CuArray(b)
	d_c = similar(d_a)  # output array

	@benchmark @cuda threads=12 kernel_vadd($d_a, $d_b, $d_c)
end

# Works
for i in 1:5
	@show i
	h(10_000, 10_000)
end

# Errors after i = 3
for i in 1:5
	@show i
	h_bench(10_000, 10_000)
end

maleadt · 2018-11-22T08:38:07Z

Problem is that CUDAdrv.CuArray memory management is pretty simple: unsafe_free! is registered as a finalizer, but Julia offers no guarantees that finalizers are called anytime quickly (since the Julia GC does not know about GPU memory pressure). As a result, some of your intermediate arrays aren't freed. You can just call finalize(d_...) manually, although that's a relatively costly call.

Alternatively, with CuArray from CuArrays.jl, the memory management is much more sophisticated and forces Julia GC collection when the GPU goes out of memory. Switching to that type (which you ought to anyway, since I've just deprecated CUDAdrv.CuArray) solves your problem and makes your example work 🙂

mohamed82008 · 2018-11-22T09:48:31Z

I switched to CuArrays, and still running into this problem.

maleadt · 2018-11-22T09:49:18Z

Does it still happen after 5+3 iterations? How much GPU does your GPU have?
Are you using the master branch of CuArrays?

mohamed82008 · 2018-11-22T09:49:53Z

CUDAnative, CuArrays, CUDAdrv, GPUArrays, CUDAapi, NNlib master branches and Julia:

julia> versioninfo()
Julia Version 1.0.1
Commit 0d713926f8 (2018-09-29 19:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

mohamed82008 · 2018-11-22T09:51:15Z

I used 5000 this time, I think this machine has less memory, and it still fails after the 3rd.

using CuArrays, CUDAnative, BenchmarkTools

function kernel_vadd(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]
    return nothing
end

function h(m, n)
	# CUDAdrv functionality: generate and upload data
	a = round.(rand(Float32, (m, n)) * 100)
	b = round.(rand(Float32, (m, n)) * 100)
	d_a = CuArray(a)
	d_b = CuArray(b)
	d_c = similar(d_a)  # output array

	@cuda threads=12 kernel_vadd(d_a, d_b, d_c)
end

function h_bench(m, n)
	# CUDAdrv functionality: generate and upload data
	a = round.(rand(Float32, (m, n)) * 100)
	b = round.(rand(Float32, (m, n)) * 100)
	d_a = CuArray(a)
	d_b = CuArray(b)
	d_c = similar(d_a)  # output array

	@benchmark @cuda threads=12 kernel_vadd($d_a, $d_b, $d_c)
end

# Works
for i in 1:5
	@show i
	h(5_000, 5_000)
end

# Errors after i = 3
for i in 1:5
	@show i
	h_bench(5_000, 5_000)
end

maleadt · 2018-11-22T10:05:41Z

Confirmed. Another bug with JuliaGPU/CuArrays.jl#169

maleadt · 2018-11-22T11:10:38Z

Ah, so apart from bugs, this is @benchmark capturing the arrays:

julia> mutable struct Foo
       bar::String
       end

julia> function main()
       x = Foo("whatever")
       finalizer(x) do x
         Core.println(x.bar)
       end
       nothing
     end
main (generic function with 1 method)

julia> main()

julia> GC.gc()
whatever

julia> function main()
       x = Foo("whatever")
       finalizer(x) do x
         Core.println(x.bar)
       end
       @benchmark $x
       nothing
     end
main (generic function with 1 method)

julia> main()

julia> GC.gc()

julia> GC.gc()

julia> GC.gc()

julia> GC.gc()

@jrevels, this seems unwanted, but somewhat expected? Are there workarounds?

maleadt · 2018-11-29T10:23:29Z

Actually, didn't mean to close this. Fixed some bugs in JuliaGPU/CuArrays.jl#212 but the @benchmark issue is still there.

jrevels · 2018-11-29T14:50:49Z

this seems unwanted, but somewhat expected?

Oof, yup. Right now, interpolated variables get closed over in the benchmarking harness here.

We should probably change this so that BenchmarkTools._run accepts a (splatted?) tuple of arguments that gets forwarded into the harness/kernel and bound to the interpolated variables. This way arguments can be passed in from top-level, which would likely yield better results than the current implementation (less brittle w.r.t. compiler optimizations) + make the implementation nicer (we could have a single harness definition instead of generating benchmark-specific ones, and probably factor out some other stuff as well).

For @btime/@benchmark, this would not be a breaking change. It would probably need to be breaking for @benchmarkable, though - that would have to now e.g. return the Benchmark + the tuple/NamedTuple for its _run arguments.

Anyway, good food for thought!

LaurentPlagne · 2020-05-27T08:48:43Z

Is there a link to the related BenchmarkTools issue ?

maleadt · 2020-05-27T14:45:58Z

JuliaCI/BenchmarkTools.jl#127 maybe

LaurentPlagne · 2020-05-27T22:40:23Z

thanx

maleadt · 2024-04-27T14:33:11Z

Closing this as it's an issue with BenchmarkTools, really.

maleadt closed this as completed Nov 22, 2018

maleadt reopened this Nov 22, 2018

maleadt transferred this issue from JuliaGPU/CUDAnative.jl Nov 22, 2018

maleadt changed the title ~~Out of memory error when benchmarking~~ at-benchmark captures GPU arrays Nov 22, 2018

maleadt closed this as completed Nov 29, 2018

maleadt reopened this Nov 29, 2018

maleadt transferred this issue from JuliaGPU/CuArrays.jl May 27, 2020

maleadt added bug Something isn't working cuda array Stuff about CuArray. upstream Somebody else's problem. labels May 27, 2020

maleadt closed this as completed Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

at-benchmark captures GPU arrays #156

at-benchmark captures GPU arrays #156

mohamed82008 commented Nov 19, 2018

maleadt commented Nov 22, 2018

mohamed82008 commented Nov 22, 2018

maleadt commented Nov 22, 2018 •

edited

Loading

mohamed82008 commented Nov 22, 2018

mohamed82008 commented Nov 22, 2018

maleadt commented Nov 22, 2018

maleadt commented Nov 22, 2018

maleadt commented Nov 29, 2018 •

edited

Loading

jrevels commented Nov 29, 2018 •

edited

Loading

LaurentPlagne commented May 27, 2020

maleadt commented May 27, 2020

LaurentPlagne commented May 27, 2020

maleadt commented Apr 27, 2024

at-benchmark captures GPU arrays #156

at-benchmark captures GPU arrays #156

Comments

mohamed82008 commented Nov 19, 2018

maleadt commented Nov 22, 2018

mohamed82008 commented Nov 22, 2018

maleadt commented Nov 22, 2018 • edited Loading

mohamed82008 commented Nov 22, 2018

mohamed82008 commented Nov 22, 2018

maleadt commented Nov 22, 2018

maleadt commented Nov 22, 2018

maleadt commented Nov 29, 2018 • edited Loading

jrevels commented Nov 29, 2018 • edited Loading

LaurentPlagne commented May 27, 2020

maleadt commented May 27, 2020

LaurentPlagne commented May 27, 2020

maleadt commented Apr 27, 2024

maleadt commented Nov 22, 2018 •

edited

Loading

maleadt commented Nov 29, 2018 •

edited

Loading

jrevels commented Nov 29, 2018 •

edited

Loading