Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reclaim GPU Memory; CUDA.reclaim() #1562

Closed
jackn11 opened this issue Jul 12, 2022 · 2 comments
Closed

Cannot reclaim GPU Memory; CUDA.reclaim() #1562

jackn11 opened this issue Jul 12, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@jackn11
Copy link

jackn11 commented Jul 12, 2022

When I set all GPU variables to nothing and call CUDA.reclaim(), my GPU memory remains full (does not go back to initial usage).

Currently the models being loaded onto the GPU is a BERT model from Transformers.jl, which is only loaded onto the gpu when training or testing, but is offloaded back to the cpu when not in use.

All the code to create the BERT model is in a module called BERTModule with no global variables. I create a few BERT models in the main module global scope by calling those functions from the BERTModule module and I store them in global variables in the main module global scope. Then I train and predict using each of the models which causes my GPU memory usage to quickly increase. Then when I set all of the variables in the main module global scope to nothing and call CUDA.reclaim(), my GPU memory usage either drops a few tens or hundreds of MB's or not at all, nowhere close to its initial value.

Furthermore, when training the many BERT modules sequentially, I ran out of memory when training one, but then called CUDA.reclaim which reclaimed a small amount of GPU memory and then I tried training the model again and it worked. The error message from the REPL for this case is below. As you can see, I try training the model (training data/size remains constant), the GPU runs out of memory, but then when I call CUDA.reclaim() and try training the model again, it works.

These appear to be bugs because CUDA.reclaim() should not be necessary to explicitly call and because when setting all variables to nothing and calling reclaim(), the expected behaviour is for GPU memory usage to go back down to the resting GPU usage.

If it is relevant, currently I am using an Nvidia 1050ti.


julia> train_func(training_dict, bert_model)

[ Info: start training
[ Info: epoch: 1
┌ Info: training
│   loss = 0.79379857f0
└   accuracy = 0.3870967741935484
[ Info: epoch: 2
[ Info: epoch: 3
ERROR: Out of GPU memory trying to allocate 89.420 MiB
Effective GPU memory usage: 100.00% (4.000 GiB/4.000 GiB)
Memory pool usage: 1.631 GiB (3.312 GiB reserved)
Stacktrace:
  [1] macro expansion
    @ C:\Users\jackn\.julia\packages\CUDA\tTK8Y\src\pool.jl:320 [inlined]
  [2] macro expansion
    @ .\timing.jl:299 [inlined]
  [3] #_alloc#170
    @ C:\Users\jackn\.julia\packages\CUDA\tTK8Y\src\pool.jl:313 [inlined]
  [4] #alloc#169
    @ C:\Users\jackn\.julia\packages\CUDA\tTK8Y\src\pool.jl:299 [inlined]
  [5] alloc
    @ C:\Users\jackn\.julia\packages\CUDA\tTK8Y\src\pool.jl:295 [inlined]
  [6] CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}(#unused#::UndefInitializer, dims::Tuple{Int64, Int64})
    @ CUDA C:\Users\jackn\.julia\packages\CUDA\tTK8Y\src\array.jl:42
  [7] similar
    @ C:\Users\jackn\.julia\packages\CUDA\tTK8Y\src\array.jl:166 [inlined]
  [8] similar
    @ .\abstractarray.jl:782 [inlined]
  [9] restructure(x::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ ArrayInterfaceCore C:\Users\jackn\.julia\packages\ArrayInterfaceCore\nBDUl\src\ArrayInterfaceCore.jl:446
 [10] update!(opt::Flux.Optimise.Adam, x::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, x̄::CuArray{Fl
oat32, 2, CUDA.Mem.DeviceBuffer})
    @ Flux.Optimise C:\Users\jackn\.julia\packages\Flux\KkC79\src\optimise\train.jl:16
 [11] update!(opt::Flux.Optimise.Adam, xs::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}}, gs::Zygote.Grads)
    @ Flux.Optimise C:\Users\jackn\.julia\packages\Flux\KkC79\src\optimise\train.jl:24
 [12] train(bert_model::Transformers.Basic.TransformerModel{Transformers.Basic.CompositeEmbedding{Float32, NamedTuple{(:tok, :segment, :pe), Tuple{Transformers.Basic.Embed{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Transformers.Basic.Embed{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Transformers.Basic.PositionEmbedding{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}}, NamedTuple{(:tok, :segment, :pe), Tuple{typeof(+), typeof(+), typeof(+)}}, Transformers.Basic.Positionwise{Tuple{Flux.LayerNorm{typeof(identity), Flux.Scale{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Float32, 1}, Flux.Dropout{Float64, Colon, CUDA.RNG}}}}, Transformers.BidirectionalEncoder.Bert{Transformers.Stacks.Stack{Symbol("((x, m) => x':(x, m)) => 12"), NTuple{12, Transformers.Basic.Transformer{Transformers.Basic.MultiheadAttention{Flux.Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Dropout{Float64, Colon, CUDA.RNG}}, Flux.LayerNorm{typeof(identity), Flux.Scale{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Float32, 1}, Transformers.Basic.PwFFN{Flux.Dense{typeof(NNlib.gelu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Flux.LayerNorm{typeof(identity), Flux.Scale{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Float32, 1}, Flux.Dropout{Float64, Colon, CUDA.RNG}}}}, 
Flux.Dropout{Float64, Colon, CUDA.RNG}}, NamedTuple{(:pooler, :clf), Tuple{Flux.Dense{typeof(tanh), 
CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Chain{Tuple{Flux.Dropout{Float64, Colon, CUDA.RNG}, Flux.Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, typeof(NNlib.logsoftmax)}}}}}, bertenc::Transformers.BidirectionalEncoder.BertTextEncoder{Transformers.Basic.TextTokenizer{Transformers.BidirectionalEncoder.WordPieceTokenization{Transformers.BidirectionalEncoder.BertUnCasedPreTokenization}}, TextEncodeBase.Vocab{String, StaticArraysCore.SizedVector{30522, String, Vector{String}}}, FuncPipelines.Pipelines{Tuple{FuncPipelines.Pipeline{:tok, FuncPipelines.ApplyN{1, Base.Fix1{typeof(TextEncodeBase.nestedcall), typeof(Transformers.Basic.string_getvalue)}}}, FuncPipelines.Pipeline{:tok, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:tok, FuncPipelines.FixRest{typeof(Transformers.BidirectionalEncoder.with_firsthead_tail), Tuple{String, String}}}}}, FuncPipelines.Pipeline{(:tok, :segment), FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:tok, typeof(Transformers.BidirectionalEncoder.segment_and_concat)}}}, FuncPipelines.Pipeline{:trunc_tok, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:tok, FuncPipelines.FixRest{typeof(TextEncodeBase.trunc_and_pad), Tuple{Nothing, String}}}}}, FuncPipelines.Pipeline{:trunc_len, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:trunc_tok, 
typeof(TextEncodeBase.nestedmaxlength)}}}, FuncPipelines.Pipeline{:mask, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{(:tok, :trunc_len), typeof(Transformers.Basic.getmask)}}}, FuncPipelines.Pipeline{:tok, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:trunc_tok, typeof(TextEncodeBase.nested2batch)}}}, FuncPipelines.Pipeline{:segment, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:segment, 
ComposedFunction{typeof(TextEncodeBase.nested2batch), FuncPipelines.FixRest{typeof(TextEncodeBase.trunc_and_pad), Tuple{Nothing, Int64}}}}}}, FuncPipelines.Pipeline{:input, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{(:tok, :segment), ComposedFunction{Type{NamedTuple{(:tok, :segment)}}, typeof(tuple)}}}}, FuncPipelines.PipeGet{(:input, :mask)}}}}, labels::TextEncodeBase.Vocab{String, StaticArraysCore.SizedVector{2, String, Vector{String}}}, training_dict::Dict{Int64, Vector{String}})        
    @ Main.Berts c:\Users\jackn\Documents\GitHub\GitHub2\Chat\NewBerts6.jl:100
 [13] (::Main.Berts.var"#generate_trainer#8"{TextEncodeBase.Vocab{String, StaticArraysCore.SizedVector{2, String, Vector{String}}}, Transformers.BidirectionalEncoder.BertTextEncoder{Transformers.Basic.TextTokenizer{Transformers.BidirectionalEncoder.WordPieceTokenization{Transformers.BidirectionalEncoder.BertUnCasedPreTokenization}}, TextEncodeBase.Vocab{String, StaticArraysCore.SizedVector{30522, 
String, Vector{String}}}, FuncPipelines.Pipelines{Tuple{FuncPipelines.Pipeline{:tok, FuncPipelines.ApplyN{1, Base.Fix1{typeof(TextEncodeBase.nestedcall), typeof(Transformers.Basic.string_getvalue)}}}, FuncPipelines.Pipeline{:tok, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:tok, FuncPipelines.FixRest{typeof(Transformers.BidirectionalEncoder.with_firsthead_tail), Tuple{String, String}}}}}, FuncPipelines.Pipeline{(:tok, :segment), FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:tok, typeof(Transformers.BidirectionalEncoder.segment_and_concat)}}}, FuncPipelines.Pipeline{:trunc_tok, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:tok, FuncPipelines.FixRest{typeof(TextEncodeBase.trunc_and_pad), Tuple{Nothing, String}}}}}, FuncPipelines.Pipeline{:trunc_len, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:trunc_tok, typeof(TextEncodeBase.nestedmaxlength)}}}, FuncPipelines.Pipeline{:mask, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{(:tok, :trunc_len), typeof(Transformers.Basic.getmask)}}}, FuncPipelines.Pipeline{:tok, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:trunc_tok, typeof(TextEncodeBase.nested2batch)}}}, FuncPipelines.Pipeline{:segment, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:segment, ComposedFunction{typeof(TextEncodeBase.nested2batch), FuncPipelines.FixRest{typeof(TextEncodeBase.trunc_and_pad), Tuple{Nothing, Int64}}}}}}, FuncPipelines.Pipeline{:input, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{(:tok, :segment), ComposedFunction{Type{NamedTuple{(:tok, :segment)}}, typeof(tuple)}}}}, FuncPipelines.PipeGet{(:input, :mask)}}}}})(train_dict::Dict{Int64, Vector{String}}, bert_model_container::Main.Berts.BertModelContainer)
    @ Main.Berts c:\Users\jackn\Documents\GitHub\GitHub2\Chat\NewBerts6.jl:291
 [14] top-level scope
    @ c:\Users\jackn\Documents\GitHub\GitHub2\Chat\NewBerts6.jl:327

julia> CUDA.reclaim()

julia> train_func(training_dict, bert_model)

[ Info: start training
[ Info: epoch: 1
┌ Info: training
│   loss = 0.78859687f0
└   accuracy = 0.41935483870967744
[ Info: epoch: 2
[ Info: epoch: 3
[ Info: epoch: 4
[ Info: epoch: 5
[ Info: epoch: 6
[ Info: epoch: 7
[ Info: epoch: 8
[ Info: epoch: 9
[ Info: epoch: 10
[ Info: epoch: 11
[ Info: epoch: 12
[ Info: epoch: 13
[ Info: epoch: 14
[ Info: epoch: 15
[ Info: epoch: 16
[ Info: epoch: 17
┌ Info: training
│   loss = 0.6262033f0
└   accuracy = 0.6451612903225806
[ Info: epoch: 18
[ Info: epoch: 19
[ Info: epoch: 20
[ Info: epoch: 21
[ Info: epoch: 22
[ Info: epoch: 23
[ Info: epoch: 24
[ Info: epoch: 25
[ Info: epoch: 26
[ Info: epoch: 27
[ Info: epoch: 28
[ Info: epoch: 29
[ Info: epoch: 30
[ Info: epoch: 31
[ Info: epoch: 32
[ Info: epoch: 33
┌ Info: training
│   loss = 0.63867134f0
└   accuracy = 0.6129032258064516
[ Info: epoch: 34
[ Info: epoch: 35
[ Info: epoch: 36
[ Info: epoch: 37
[ Info: epoch: 38
[ Info: epoch: 39
[ Info: epoch: 40
[ Info: epoch: 41
[ Info: epoch: 42
[ Info: epoch: 43
[ Info: epoch: 44
[ Info: epoch: 45
[ Info: epoch: 46
[ Info: epoch: 47
[ Info: epoch: 48
[ Info: epoch: 49
┌ Info: training
│   loss = 0.47331774f0
└   accuracy = 0.9354838709677419
[ Info: epoch: 50
[ Info: epoch: 51
[ Info: epoch: 52
[ Info: epoch: 53
[ Info: epoch: 54
[ Info: epoch: 55
[ Info: epoch: 56
[ Info: epoch: 57
[ Info: epoch: 58
[ Info: epoch: 59
[ Info: epoch: 60
[ Info: epoch: 61
[ Info: epoch: 62
[ Info: epoch: 63
[ Info: epoch: 64
[ Info: epoch: 65
┌ Info: training
│   loss = 0.5309651f0
└   accuracy = 0.6774193548387096
[ Info: epoch: 66
[ Info: epoch: 67
[ Info: epoch: 68
[ Info: epoch: 69
[ Info: epoch: 70
[ Info: epoch: 71
[ Info: epoch: 72
[ Info: epoch: 73
[ Info: epoch: 74
[ Info: epoch: 75
[ Info: epoch: 76
[ Info: epoch: 77
[ Info: epoch: 78
[ Info: epoch: 79
[ Info: epoch: 80
[ Info: epoch: 81
┌ Info: training
│   loss = 0.4070098f0
└   accuracy = 0.8387096774193549
[ Info: epoch: 82
[ Info: epoch: 83
[ Info: epoch: 84
[ Info: epoch: 85
[ Info: epoch: 86
[ Info: epoch: 87
[ Info: epoch: 88
[ Info: epoch: 89
[ Info: epoch: 90
[ Info: epoch: 91
[ Info: epoch: 92
[ Info: epoch: 93
[ Info: epoch: 94
[ Info: epoch: 95
[ Info: epoch: 96
[ Info: epoch: 97
┌ Info: training
│   loss = 0.37719026f0
└   accuracy = 0.9032258064516129
[ Info: epoch: 98
[ Info: epoch: 99
[ Info: epoch: 100
[ Info: testing
┌ Info: testing
└   accuracy = 0.967741935483871
@jackn11 jackn11 added the bug Something isn't working label Jul 12, 2022
@jackn11 jackn11 changed the title CUDA.reclaim() Not Working Cannot reclaim GPU Memory; CUDA.reclaim() Jul 12, 2022
@jackn11
Copy link
Author

jackn11 commented Jul 12, 2022

Is it possible that storing the compiled functions from BERTModule are taking up the storage on the GPU? If so, is there a way I can clear some of those functions from memory?

@maleadt
Copy link
Member

maleadt commented Apr 27, 2024

Memory handling and GC integration has changed significantly, so I don't think this issue as reported here is still relevant. If the problem persists on CUDA.jl#master, feel free to open a new issue!

@maleadt maleadt closed this as completed Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants