Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive allocations when running on multiple threads #1429

Closed
mfalt opened this issue Mar 3, 2022 · 6 comments
Closed

Excessive allocations when running on multiple threads #1429

mfalt opened this issue Mar 3, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@mfalt
Copy link

mfalt commented Mar 3, 2022

I'm trying to run a neutral network (using Flux) on a production server. I have no problems when running everything in one thread, but memory allocations on the GPU starts increasing when i put it behind a Mux.jl server and eventually I get a cuda out of memory error (or other cuda related error). This happens regardless if I try to run all the requests serially or in parallel, even with a lock around the gpu computations.

The following example runs without problems without Threads.@spawn, and never allocates above 1367MiB on the GPU. However, when the computations are run in a separate thread as in this example, it keeps allocating memory on every call until it eventually crashes somehow.

MWE:

using CUDA

import CUDA.CUDNN: cudnnConvolutionForward

const W1 = cu(randn(5,5,3,6))

function inference(imgs)
    out = cudnnConvolutionForward(W1, imgs)
    return maximum(Array(out))
end

imgs = cu(randn(28,28,3,1000))

N = 100
# Run this a few times
for j in 1:10
    res = Any[nothing for i in 1:N]
    for i in 1:N
        res[i] = fetch(Threads.@spawn inference(imgs))
    end
    s = sum(res)
    println(s)
end
julia> ERROR: LoadError: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:334 [inlined]
 [2] fetch(t::Task)
   @ Base ./task.jl:349
 [3] top-level scope
   @ ~/cudatest/cuda2.jl:18

    nested task error: CUDNNError: CUDNN_STATUS_INTERNAL_ERROR (code 4)
    Stacktrace:
      [1] throw_api_error(res::CUDA.CUDNN.cudnnStatus_t)
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/error.jl:22
      [2] macro expansion
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/error.jl:35 [inlined]
      [3] cudnnCreate()
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/base.jl:3
      [4] #1218
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:72 [inlined]
      [5] (::CUDA.APIUtils.var"#8#11"{CUDA.CUDNN.var"#1218#1225", CUDA.APIUtils.HandleCache{CuContext, Ptr{Nothing}}, CuContext})()
        @ CUDA.APIUtils ~/.julia/packages/CUDA/Axzxe/lib/utils/cache.jl:24
      [6] lock(f::CUDA.APIUtils.var"#8#11"{CUDA.CUDNN.var"#1218#1225", CUDA.APIUtils.HandleCache{CuContext, Ptr{Nothing}}, CuContext}, l::ReentrantLock)
        @ Base ./lock.jl:190
      [7] (::CUDA.APIUtils.var"#check_cache#9"{CUDA.APIUtils.HandleCache{CuContext, Ptr{Nothing}}, CuContext})(f::CUDA.CUDNN.var"#1218#1225")
        @ CUDA.APIUtils ~/.julia/packages/CUDA/Axzxe/lib/utils/cache.jl:22
      [8] pop!
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/cache.jl:46 [inlined]
      [9] (::CUDA.CUDNN.var"#new_state#1224")(cuda::NamedTuple{(:device, :context, :stream, :math_mode, :math_precision), Tuple{CuDevice, CuContext, CuStream, CUDA.MathMode, Symbol}})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:71
     [10] #1222
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:88 [inlined]
     [11] get!(default::CUDA.CUDNN.var"#1222#1229"{CUDA.CUDNN.var"#new_state#1224", NamedTuple{(:device, :context, :stream, :math_mode, :math_precision), Tuple{CuDevice, CuContext, CuStream, CUDA.MathMode, Symbol}}}, h::Dict{CuContext, NamedTuple{(:handle, :stream), Tuple{Ptr{Nothing}, CuStream}}}, key::CuContext)
        @ Base ./dict.jl:464
     [12] handle()
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:87
     [13] (::CUDA.CUDNN.var"#1145#1147"{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnActivationMode_t, CUDA.CUDNN.cudnnConvolutionDescriptor, CUDA.CUDNN.cudnnFilterDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, Base.RefValue{Float32}, Base.RefValue{Float32}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnConvolutionFwdAlgoPerfStruct})(workspace::CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:105
     [14] with_workspace(f::CUDA.CUDNN.var"#1145#1147"{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnActivationMode_t, CUDA.CUDNN.cudnnConvolutionDescriptor, CUDA.CUDNN.cudnnFilterDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, Base.RefValue{Float32}, Base.RefValue{Float32}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnConvolutionFwdAlgoPerfStruct}, eltyp::Type{UInt8}, size::CUDA.APIUtils.var"#2#3"{UInt64}, fallback::Nothing; keep::Bool)
        @ CUDA.APIUtils ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:77
     [15] with_workspace
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:58 [inlined]
     [16] #with_workspace#1
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:53 [inlined]
     [17] with_workspace (repeats 2 times)
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:53 [inlined]
     [18] cudnnConvolutionForwardAD(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, bias::Nothing, z::Nothing; y::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, activation::CUDA.CUDNN.cudnnActivationMode_t, convDesc::CUDA.CUDNN.cudnnConvolutionDescriptor, wDesc::CUDA.CUDNN.cudnnFilterDescriptor, xDesc::CUDA.CUDNN.cudnnTensorDescriptor, yDesc::CUDA.CUDNN.cudnnTensorDescriptor, zDesc::Nothing, biasDesc::Nothing, alpha::Base.RefValue{Float32}, beta::Base.RefValue{Float32}, dw::Base.RefValue{Any}, dx::Base.RefValue{Any}, dz::Base.RefValue{Any}, dbias::Base.RefValue{Any}, dready::Base.RefValue{Bool})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:103
     [19] cudnnConvolutionForwardWithDefaults(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}; padding::Int64, stride::Int64, dilation::Int64, mode::CUDA.CUDNN.cudnnConvolutionMode_t, mathType::CUDA.CUDNN.cudnnMathType_t, reorderType::CUDA.CUDNN.cudnnReorderType_t, group::Int64, format::CUDA.CUDNN.cudnnTensorFormat_t, convDesc::CUDA.CUDNN.cudnnConvolutionDescriptor, xDesc::CUDA.CUDNN.cudnnTensorDescriptor, wDesc::CUDA.CUDNN.cudnnFilterDescriptor, y::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, yDesc::CUDA.CUDNN.cudnnTensorDescriptor, alpha::Int64, beta::Int64, bias::Nothing, z::Nothing, biasDesc::Nothing, zDesc::Nothing, activation::CUDA.CUDNN.cudnnActivationMode_t, dw::Base.RefValue{Any}, dx::Base.RefValue{Any}, dz::Base.RefValue{Any}, dbias::Base.RefValue{Any})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:96
     [20] cudnnConvolutionForwardWithDefaults(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:92
     [21] #cudnnConvolutionForward#1139
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:50 [inlined]
     [22] cudnnConvolutionForward
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:50 [inlined]
     [23] inference(imgs::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
        @ Main ~/cudatest/cuda2.jl:8
     [24] (::var"#11#12")()
        @ Main ./threadingconstructs.jl:178
in expression starting at /home/mattias/cudatest/cuda2.jl:15

Running on new project with only CUDA v3.8.3 as dependency.

julia> CUDA.versioninfo()
CUDA toolkit 11.6, artifact installation
NVIDIA driver 470.103.1, for CUDA 11.4
CUDA driver 11.4

Libraries: 
- CUBLAS: 11.8.1
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+470.103.1
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA GeForce GTX 1080 Ti (sm_61, 8.204 GiB / 10.913 GiB available)

Before running loop first time:

julia> CUDA.memory_status()
Effective GPU memory usage: 2.03% (226.625 MiB/10.913 GiB)
Memory pool usage: 8.974 MiB (32.000 MiB reserved)

After 1:

Effective GPU memory usage: 37.47% (4.090 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)

After 2:

Effective GPU memory usage: 62.13% (6.780 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)

After 3:

Effective GPU memory usage: 86.83% (9.476 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)

After 4 (and crash)

Effective GPU memory usage: 24.95% (2.723 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (64.000 MiB reserved)
@mfalt mfalt added the bug Something isn't working label Mar 3, 2022
@mfalt
Copy link
Author

mfalt commented Mar 3, 2022

The problem definitely seems to be with how memory is created on threads. By running on the same set of threads every time, the problem seem to dissapear. I am not sure how useful this solution could be for my actual use case, but it might give some insight into the problem.

using CUDA

import CUDA.CUDNN: cudnnConvolutionForward

const W1 = cu(randn(5,5,3,6))

function inference(imgs)
    out = cudnnConvolutionForward(W1, imgs)
    return maximum(Array(out))
end

# Channel to put work on
const RecieverChannel = Channel{Tuple{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer},Channel{Float32}}}()
function listen_channel()
    while true
        imgs, res_channel = take!(RecieverChannel)
        res = inference(imgs)
        put!(res_channel, res)
    end
end
# Create N tasks to do the GPU computations
tsks = [Threads.@spawn listen_channel() for i in 1:2]

# Ask one of the Thread-workers to do the job
function inference_caller(imgs)
    res_channel = Channel{Float32}()
    put!(RecieverChannel, (imgs,res_channel))
    take!(res_channel)
end

imgs = cu(randn(28,28,3,1000))

N = 100
for k in 1:20
    for j in 1:10
        res = Any[nothing for i in 1:N]
        for i in 1:N # We spawn a lot of work
            res[i] = Threads.@spawn inference_caller(imgs)
        end
        s = sum(fetch.(res))
        println(s)
    end
end

@maleadt
Copy link
Member

maleadt commented Mar 4, 2022

Hmm, I cannot reproduce. On Linux, CUDA.jl#master, Julia 1.7.2 with 32 threads, trying your original example:

julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 12.74% (1.880 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 13.13% (1.937 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 13.40% (1.976 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 13.52% (1.995 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

Running it in the REPL, i.e. not in a main() function, doesn't seem to matter.
Can you try CUDA.jl#master?

@mfalt
Copy link
Author

mfalt commented Mar 8, 2022

The problem was not solved by changing CUDA.jl versions or changing system drivers. However, the problem does not seem to be reproducible on other machines. I tried myself on AWS without any problems.

@maleadt
Copy link
Member

maleadt commented Mar 8, 2022

Very strange. Using the same Julia binaries everywhere? Number of threads Julia was launched with?

@mfalt
Copy link
Author

mfalt commented Mar 8, 2022

Yes, I have tried several different configurations but I am only able to reproduce it on that machine. Not sure if it could be specific to the gpu model or even related the the specific hardware.

@maleadt
Copy link
Member

maleadt commented Apr 26, 2024

It's been a while since there's been activity here, so I'm going to close this. If this still happens on latest master, don't hesitate to open a new issue with an updated MWE.

@maleadt maleadt closed this as completed Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants