Excessive allocations when running on multiple threads #1429

mfalt · 2022-03-03T13:32:47Z

I'm trying to run a neutral network (using Flux) on a production server. I have no problems when running everything in one thread, but memory allocations on the GPU starts increasing when i put it behind a Mux.jl server and eventually I get a cuda out of memory error (or other cuda related error). This happens regardless if I try to run all the requests serially or in parallel, even with a lock around the gpu computations.

The following example runs without problems without Threads.@spawn, and never allocates above 1367MiB on the GPU. However, when the computations are run in a separate thread as in this example, it keeps allocating memory on every call until it eventually crashes somehow.

MWE:

using CUDA

import CUDA.CUDNN: cudnnConvolutionForward

const W1 = cu(randn(5,5,3,6))

function inference(imgs)
    out = cudnnConvolutionForward(W1, imgs)
    return maximum(Array(out))
end

imgs = cu(randn(28,28,3,1000))

N = 100
# Run this a few times
for j in 1:10
    res = Any[nothing for i in 1:N]
    for i in 1:N
        res[i] = fetch(Threads.@spawn inference(imgs))
    end
    s = sum(res)
    println(s)
end

julia> ERROR: LoadError: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:334 [inlined]
 [2] fetch(t::Task)
   @ Base ./task.jl:349
 [3] top-level scope
   @ ~/cudatest/cuda2.jl:18

    nested task error: CUDNNError: CUDNN_STATUS_INTERNAL_ERROR (code 4)
    Stacktrace:
      [1] throw_api_error(res::CUDA.CUDNN.cudnnStatus_t)
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/error.jl:22
      [2] macro expansion
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/error.jl:35 [inlined]
      [3] cudnnCreate()
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/base.jl:3
      [4] #1218
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:72 [inlined]
      [5] (::CUDA.APIUtils.var"#8#11"{CUDA.CUDNN.var"#1218#1225", CUDA.APIUtils.HandleCache{CuContext, Ptr{Nothing}}, CuContext})()
        @ CUDA.APIUtils ~/.julia/packages/CUDA/Axzxe/lib/utils/cache.jl:24
      [6] lock(f::CUDA.APIUtils.var"#8#11"{CUDA.CUDNN.var"#1218#1225", CUDA.APIUtils.HandleCache{CuContext, Ptr{Nothing}}, CuContext}, l::ReentrantLock)
        @ Base ./lock.jl:190
      [7] (::CUDA.APIUtils.var"#check_cache#9"{CUDA.APIUtils.HandleCache{CuContext, Ptr{Nothing}}, CuContext})(f::CUDA.CUDNN.var"#1218#1225")
        @ CUDA.APIUtils ~/.julia/packages/CUDA/Axzxe/lib/utils/cache.jl:22
      [8] pop!
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/cache.jl:46 [inlined]
      [9] (::CUDA.CUDNN.var"#new_state#1224")(cuda::NamedTuple{(:device, :context, :stream, :math_mode, :math_precision), Tuple{CuDevice, CuContext, CuStream, CUDA.MathMode, Symbol}})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:71
     [10] #1222
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:88 [inlined]
     [11] get!(default::CUDA.CUDNN.var"#1222#1229"{CUDA.CUDNN.var"#new_state#1224", NamedTuple{(:device, :context, :stream, :math_mode, :math_precision), Tuple{CuDevice, CuContext, CuStream, CUDA.MathMode, Symbol}}}, h::Dict{CuContext, NamedTuple{(:handle, :stream), Tuple{Ptr{Nothing}, CuStream}}}, key::CuContext)
        @ Base ./dict.jl:464
     [12] handle()
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:87
     [13] (::CUDA.CUDNN.var"#1145#1147"{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnActivationMode_t, CUDA.CUDNN.cudnnConvolutionDescriptor, CUDA.CUDNN.cudnnFilterDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, Base.RefValue{Float32}, Base.RefValue{Float32}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnConvolutionFwdAlgoPerfStruct})(workspace::CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:105
     [14] with_workspace(f::CUDA.CUDNN.var"#1145#1147"{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnActivationMode_t, CUDA.CUDNN.cudnnConvolutionDescriptor, CUDA.CUDNN.cudnnFilterDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, Base.RefValue{Float32}, Base.RefValue{Float32}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnConvolutionFwdAlgoPerfStruct}, eltyp::Type{UInt8}, size::CUDA.APIUtils.var"#2#3"{UInt64}, fallback::Nothing; keep::Bool)
        @ CUDA.APIUtils ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:77
     [15] with_workspace
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:58 [inlined]
     [16] #with_workspace#1
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:53 [inlined]
     [17] with_workspace (repeats 2 times)
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:53 [inlined]
     [18] cudnnConvolutionForwardAD(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, bias::Nothing, z::Nothing; y::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, activation::CUDA.CUDNN.cudnnActivationMode_t, convDesc::CUDA.CUDNN.cudnnConvolutionDescriptor, wDesc::CUDA.CUDNN.cudnnFilterDescriptor, xDesc::CUDA.CUDNN.cudnnTensorDescriptor, yDesc::CUDA.CUDNN.cudnnTensorDescriptor, zDesc::Nothing, biasDesc::Nothing, alpha::Base.RefValue{Float32}, beta::Base.RefValue{Float32}, dw::Base.RefValue{Any}, dx::Base.RefValue{Any}, dz::Base.RefValue{Any}, dbias::Base.RefValue{Any}, dready::Base.RefValue{Bool})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:103
     [19] cudnnConvolutionForwardWithDefaults(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}; padding::Int64, stride::Int64, dilation::Int64, mode::CUDA.CUDNN.cudnnConvolutionMode_t, mathType::CUDA.CUDNN.cudnnMathType_t, reorderType::CUDA.CUDNN.cudnnReorderType_t, group::Int64, format::CUDA.CUDNN.cudnnTensorFormat_t, convDesc::CUDA.CUDNN.cudnnConvolutionDescriptor, xDesc::CUDA.CUDNN.cudnnTensorDescriptor, wDesc::CUDA.CUDNN.cudnnFilterDescriptor, y::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, yDesc::CUDA.CUDNN.cudnnTensorDescriptor, alpha::Int64, beta::Int64, bias::Nothing, z::Nothing, biasDesc::Nothing, zDesc::Nothing, activation::CUDA.CUDNN.cudnnActivationMode_t, dw::Base.RefValue{Any}, dx::Base.RefValue{Any}, dz::Base.RefValue{Any}, dbias::Base.RefValue{Any})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:96
     [20] cudnnConvolutionForwardWithDefaults(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:92
     [21] #cudnnConvolutionForward#1139
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:50 [inlined]
     [22] cudnnConvolutionForward
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:50 [inlined]
     [23] inference(imgs::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
        @ Main ~/cudatest/cuda2.jl:8
     [24] (::var"#11#12")()
        @ Main ./threadingconstructs.jl:178
in expression starting at /home/mattias/cudatest/cuda2.jl:15

Running on new project with only CUDA v3.8.3 as dependency.

julia> CUDA.versioninfo()
CUDA toolkit 11.6, artifact installation
NVIDIA driver 470.103.1, for CUDA 11.4
CUDA driver 11.4

Libraries: 
- CUBLAS: 11.8.1
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+470.103.1
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA GeForce GTX 1080 Ti (sm_61, 8.204 GiB / 10.913 GiB available)

Before running loop first time:

julia> CUDA.memory_status()
Effective GPU memory usage: 2.03% (226.625 MiB/10.913 GiB)
Memory pool usage: 8.974 MiB (32.000 MiB reserved)

After 1:

Effective GPU memory usage: 37.47% (4.090 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)

After 2:

Effective GPU memory usage: 62.13% (6.780 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)

After 3:

Effective GPU memory usage: 86.83% (9.476 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)

After 4 (and crash)

Effective GPU memory usage: 24.95% (2.723 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (64.000 MiB reserved)

The text was updated successfully, but these errors were encountered:

mfalt · 2022-03-03T16:47:41Z

The problem definitely seems to be with how memory is created on threads. By running on the same set of threads every time, the problem seem to dissapear. I am not sure how useful this solution could be for my actual use case, but it might give some insight into the problem.

using CUDA

import CUDA.CUDNN: cudnnConvolutionForward

const W1 = cu(randn(5,5,3,6))

function inference(imgs)
    out = cudnnConvolutionForward(W1, imgs)
    return maximum(Array(out))
end

# Channel to put work on
const RecieverChannel = Channel{Tuple{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer},Channel{Float32}}}()
function listen_channel()
    while true
        imgs, res_channel = take!(RecieverChannel)
        res = inference(imgs)
        put!(res_channel, res)
    end
end
# Create N tasks to do the GPU computations
tsks = [Threads.@spawn listen_channel() for i in 1:2]

# Ask one of the Thread-workers to do the job
function inference_caller(imgs)
    res_channel = Channel{Float32}()
    put!(RecieverChannel, (imgs,res_channel))
    take!(res_channel)
end

imgs = cu(randn(28,28,3,1000))

N = 100
for k in 1:20
    for j in 1:10
        res = Any[nothing for i in 1:N]
        for i in 1:N # We spawn a lot of work
            res[i] = Threads.@spawn inference_caller(imgs)
        end
        s = sum(fetch.(res))
        println(s)
    end
end

maleadt · 2022-03-04T11:14:58Z

Hmm, I cannot reproduce. On Linux, CUDA.jl#master, Julia 1.7.2 with 32 threads, trying your original example:

julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 12.74% (1.880 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 13.13% (1.937 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 13.40% (1.976 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 13.52% (1.995 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

Running it in the REPL, i.e. not in a main() function, doesn't seem to matter.
Can you try CUDA.jl#master?

mfalt · 2022-03-08T09:51:09Z

The problem was not solved by changing CUDA.jl versions or changing system drivers. However, the problem does not seem to be reproducible on other machines. I tried myself on AWS without any problems.

maleadt · 2022-03-08T10:01:19Z

Very strange. Using the same Julia binaries everywhere? Number of threads Julia was launched with?

mfalt · 2022-03-08T10:03:47Z

Yes, I have tried several different configurations but I am only able to reproduce it on that machine. Not sure if it could be specific to the gpu model or even related the the specific hardware.

maleadt · 2024-04-26T18:52:40Z

It's been a while since there's been activity here, so I'm going to close this. If this still happens on latest master, don't hesitate to open a new issue with an updated MWE.

mfalt added the bug Something isn't working label Mar 3, 2022

maleadt closed this as completed Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive allocations when running on multiple threads #1429

Excessive allocations when running on multiple threads #1429

mfalt commented Mar 3, 2022 •

edited

Loading

mfalt commented Mar 3, 2022

maleadt commented Mar 4, 2022 •

edited

Loading

mfalt commented Mar 8, 2022

maleadt commented Mar 8, 2022

mfalt commented Mar 8, 2022

maleadt commented Apr 26, 2024

Excessive allocations when running on multiple threads #1429

Excessive allocations when running on multiple threads #1429

Comments

mfalt commented Mar 3, 2022 • edited Loading

mfalt commented Mar 3, 2022

maleadt commented Mar 4, 2022 • edited Loading

mfalt commented Mar 8, 2022

maleadt commented Mar 8, 2022

mfalt commented Mar 8, 2022

maleadt commented Apr 26, 2024

mfalt commented Mar 3, 2022 •

edited

Loading

maleadt commented Mar 4, 2022 •

edited

Loading