Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

at-benchmark captures GPU arrays #156

Closed
mohamed82008 opened this issue Nov 19, 2018 · 13 comments
Closed

at-benchmark captures GPU arrays #156

mohamed82008 opened this issue Nov 19, 2018 · 13 comments
Labels
bug Something isn't working cuda array Stuff about CuArray. upstream Somebody else's problem.

Comments

@mohamed82008
Copy link

The following code triggers this error:

ERROR: CUDA error: out of memory (code JuliaGPU/CuArrays.jl#2, ERROR_OUT_OF_MEMORY)
Stacktrace:
 [1] macro expansion at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\base.jl:147 [inlined]
 [2] #alloc#3(::CUDAdrv.Mem.CUmem_attach, ::Function, ::Int64, ::Bool) at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\memory.jl:161
 [3] alloc at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\memory.jl:157 [inlined] (repeats 2 times)
 [4] CuArray{Float32,2}(::Tuple{Int64,Int64}) at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\array.jl:33
 [5] similar at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\array.jl:83 [inlined]
 [6] similar at .\abstractarray.jl:571 [inlined]
 [7] h_bench(::Int64, ::Int64) at .\REPL[4]:7
 [8] macro expansion at .\show.jl:555 [inlined]
 [9] top-level scope at .\REPL[6]:2 [inlined]
 [10] top-level scope at .\none:0
using CUDAdrv, CUDAnative, BenchmarkTools

function kernel_vadd(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]
    return nothing
end

function h(m, n)
	# CUDAdrv functionality: generate and upload data
	a = round.(rand(Float32, (m, n)) * 100)
	b = round.(rand(Float32, (m, n)) * 100)
	d_a = CuArray(a)
	d_b = CuArray(b)
	d_c = similar(d_a)  # output array

	@cuda threads=12 kernel_vadd(d_a, d_b, d_c)
end

function h_bench(m, n)
	# CUDAdrv functionality: generate and upload data
	a = round.(rand(Float32, (m, n)) * 100)
	b = round.(rand(Float32, (m, n)) * 100)
	d_a = CuArray(a)
	d_b = CuArray(b)
	d_c = similar(d_a)  # output array

	@benchmark @cuda threads=12 kernel_vadd($d_a, $d_b, $d_c)
end

# Works
for i in 1:5
	@show i
	h(10_000, 10_000)
end

# Errors after i = 3
for i in 1:5
	@show i
	h_bench(10_000, 10_000)
end
@maleadt
Copy link
Member

maleadt commented Nov 22, 2018

Problem is that CUDAdrv.CuArray memory management is pretty simple: unsafe_free! is registered as a finalizer, but Julia offers no guarantees that finalizers are called anytime quickly (since the Julia GC does not know about GPU memory pressure). As a result, some of your intermediate arrays aren't freed. You can just call finalize(d_...) manually, although that's a relatively costly call.

Alternatively, with CuArray from CuArrays.jl, the memory management is much more sophisticated and forces Julia GC collection when the GPU goes out of memory. Switching to that type (which you ought to anyway, since I've just deprecated CUDAdrv.CuArray) solves your problem and makes your example work 🙂

@maleadt maleadt closed this as completed Nov 22, 2018
@mohamed82008
Copy link
Author

I switched to CuArrays, and still running into this problem.

@maleadt
Copy link
Member

maleadt commented Nov 22, 2018

Does it still happen after 5+3 iterations? How much GPU does your GPU have?
Are you using the master branch of CuArrays?

@maleadt maleadt reopened this Nov 22, 2018
@mohamed82008
Copy link
Author

CUDAnative, CuArrays, CUDAdrv, GPUArrays, CUDAapi, NNlib master branches and Julia:

julia> versioninfo()
Julia Version 1.0.1
Commit 0d713926f8 (2018-09-29 19:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

@mohamed82008
Copy link
Author

I used 5000 this time, I think this machine has less memory, and it still fails after the 3rd.

using CuArrays, CUDAnative, BenchmarkTools

function kernel_vadd(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]
    return nothing
end

function h(m, n)
	# CUDAdrv functionality: generate and upload data
	a = round.(rand(Float32, (m, n)) * 100)
	b = round.(rand(Float32, (m, n)) * 100)
	d_a = CuArray(a)
	d_b = CuArray(b)
	d_c = similar(d_a)  # output array

	@cuda threads=12 kernel_vadd(d_a, d_b, d_c)
end

function h_bench(m, n)
	# CUDAdrv functionality: generate and upload data
	a = round.(rand(Float32, (m, n)) * 100)
	b = round.(rand(Float32, (m, n)) * 100)
	d_a = CuArray(a)
	d_b = CuArray(b)
	d_c = similar(d_a)  # output array

	@benchmark @cuda threads=12 kernel_vadd($d_a, $d_b, $d_c)
end

# Works
for i in 1:5
	@show i
	h(5_000, 5_000)
end

# Errors after i = 3
for i in 1:5
	@show i
	h_bench(5_000, 5_000)
end

@maleadt maleadt transferred this issue from JuliaGPU/CUDAnative.jl Nov 22, 2018
@maleadt
Copy link
Member

maleadt commented Nov 22, 2018

Confirmed. Another bug with JuliaGPU/CuArrays.jl#169

@maleadt
Copy link
Member

maleadt commented Nov 22, 2018

Ah, so apart from bugs, this is @benchmark capturing the arrays:

julia> mutable struct Foo
       bar::String
       end

julia> function main()
       x = Foo("whatever")
       finalizer(x) do x
         Core.println(x.bar)
       end
       nothing
     end
main (generic function with 1 method)

julia> main()

julia> GC.gc()
whatever

julia> function main()
       x = Foo("whatever")
       finalizer(x) do x
         Core.println(x.bar)
       end
       @benchmark $x
       nothing
     end
main (generic function with 1 method)

julia> main()

julia> GC.gc()

julia> GC.gc()

julia> GC.gc()

julia> GC.gc()

@jrevels, this seems unwanted, but somewhat expected? Are there workarounds?

@maleadt maleadt changed the title Out of memory error when benchmarking at-benchmark captures GPU arrays Nov 22, 2018
@maleadt maleadt closed this as completed Nov 29, 2018
@maleadt
Copy link
Member

maleadt commented Nov 29, 2018

Actually, didn't mean to close this. Fixed some bugs in JuliaGPU/CuArrays.jl#212 but the @benchmark issue is still there.

@maleadt maleadt reopened this Nov 29, 2018
@jrevels
Copy link
Contributor

jrevels commented Nov 29, 2018

this seems unwanted, but somewhat expected?

Oof, yup. Right now, interpolated variables get closed over in the benchmarking harness here.

We should probably change this so that BenchmarkTools._run accepts a (splatted?) tuple of arguments that gets forwarded into the harness/kernel and bound to the interpolated variables. This way arguments can be passed in from top-level, which would likely yield better results than the current implementation (less brittle w.r.t. compiler optimizations) + make the implementation nicer (we could have a single harness definition instead of generating benchmark-specific ones, and probably factor out some other stuff as well).

For @btime/@benchmark, this would not be a breaking change. It would probably need to be breaking for @benchmarkable, though - that would have to now e.g. return the Benchmark + the tuple/NamedTuple for its _run arguments.

Anyway, good food for thought!

@maleadt maleadt transferred this issue from JuliaGPU/CuArrays.jl May 27, 2020
@maleadt maleadt added bug Something isn't working cuda array Stuff about CuArray. upstream Somebody else's problem. labels May 27, 2020
@LaurentPlagne
Copy link

Is there a link to the related BenchmarkTools issue ?

@maleadt
Copy link
Member

maleadt commented May 27, 2020

JuliaCI/BenchmarkTools.jl#127 maybe

@LaurentPlagne
Copy link

thanx

@maleadt
Copy link
Member

maleadt commented Apr 27, 2024

Closing this as it's an issue with BenchmarkTools, really.

@maleadt maleadt closed this as completed Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda array Stuff about CuArray. upstream Somebody else's problem.
Projects
None yet
Development

No branches or pull requests

4 participants