Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider running GC when allocating and synchronizing #2304

Merged
merged 7 commits into from
Apr 22, 2024
Merged

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Mar 26, 2024

Implements #2303; cc @gbaraldi

@maleadt maleadt added enhancement New feature or request cuda array Stuff about CuArray. labels Mar 26, 2024
src/pool.jl Outdated Show resolved Hide resolved
Copy link

codecov bot commented Mar 26, 2024

Codecov Report

Attention: Patch coverage is 92.10526% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 60.45%. Comparing base (c5fcd73) to head (577beb9).

Files Patch % Lines
src/pool.jl 93.75% 4 Missing ⚠️
lib/cudadrv/synchronization.jl 88.88% 1 Missing ⚠️
lib/cudnn/src/convolution.jl 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2304       +/-   ##
===========================================
- Coverage   71.86%   60.45%   -11.42%     
===========================================
  Files         155      155               
  Lines       15020    14959       -61     
===========================================
- Hits        10794     9043     -1751     
- Misses       4226     5916     +1690     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mutable struct AllocStats
alloc_count::Threads.Atomic{Int}
alloc_bytes::Threads.Atomic{Int}

free_count::Threads.Atomic{Int}
free_bytes::Threads.Atomic{Int}

total_time::MaybeAtomicFloat64
total_time::Threads.Atomic{Float64}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use @atomic total_time::Float64?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could update almost all of these then?

@BioTurboNick
Copy link

BioTurboNick commented Mar 30, 2024

Trying this PR on my example from FluxML/Flux.jl#2414 on my NVIDIA RTX 4080 on Windows 11

... and it hung Julia and made the process unkillable in Task Manager.

EDIT: apparently this is often caused by a driver fault. https://superuser.com/questions/136272/how-can-i-kill-an-unkillable-process. And eventually I got a BSOD for KMODE_EXCEPTION_NOT_HANDLED in the NVIDIA driver

EDIT2: Though also this is in comparison with 5.2, not the current master. ... and the base of the branch hung my whole system 🙃

@CarloLucibello
Copy link
Contributor

The example in https://discourse.julialang.org/t/gpu-memory-usage-increasing-on-each-epoch-flux/112942/2
is worth benchmarking as well. Let me know if you need help with testing!

@maleadt
Copy link
Member Author

maleadt commented Apr 18, 2024

Thanks, the example from https://discourse.julialang.org/t/gpu-memory-usage-increasing-on-each-epoch-flux/112942 was useful. I tuned the heuristic, and in low-memory situations it significantly improves performance, while with more reasonable amounts of memory available it smoothens out the cost of garbage collection and results in slightly shorter pauses and more consistent execution times. It's possible to disable the heuristic by setting JULIA_CUDA_GC_EARLY=false, making it easy to compare.

Taking the Flux example from Discourse:

using Flux
using MLUtils: DataLoader
using CUDA
using NVTX

function increasing_gpu_memory_usage()
    n_obs = 300_000
    n_feature = 1000
    X = rand(n_feature, n_obs)
    y = rand(1, n_obs)
    train_data = DataLoader((X, y) |< gpu; batchsize = 2048, shuffle = false)

    model = Dense(n_feature, >) |< gpu
    loss(m, _x, _y) = Flux.Losses.mse(m(_x), _>)
    opt_state = Flux.setup(Flux.Adam(), model)
    Flux.train!(loss, model, train_data, opt_state)
    total_time = @elapsed begin
        CUDA.@profile external=true for epoch in 1:100
            NVTX.@range "Epoch $epoch" begin
                train_time = @elapsed Flux.train!(loss, model, train_data, opt_state)
                @info "Epoch $(epoch) train time $(round(train_time, digits=3))"
            end
        end
    end
    @info "Total time $(round(total_time, digits=3))"
    return
end

isinteractive() || increasing_gpu_memory_usage()

Old behavior, with a 4GiB memory limit:

❯ JULIA_CUDA_GC_EARLY=false JULIA_CUDA_HARD_MEMORY_LIMIT=4GiB julia --project wip.jl
...
[ Info: Epoch 90 train time 0.031
retry_reclaim: freed 2.865 GiB
[ Info: Epoch 91 train time 0.031
[ Info: Epoch 92 train time 0.027
retry_reclaim: freed 2.865 GiB
[ Info: Epoch 93 train time 0.03
retry_reclaim: freed 2.873 GiB
[ Info: Epoch 94 train time 0.031
retry_reclaim: freed 2.873 GiB
[ Info: Epoch 95 train time 0.03
retry_reclaim: freed 2.873 GiB
[ Info: Epoch 96 train time 0.031
[ Info: Epoch 97 train time 0.027
retry_reclaim: freed 2.873 GiB
[ Info: Epoch 98 train time 0.031
retry_reclaim: freed 2.865 GiB
[ Info: Epoch 99 train time 0.031
retry_reclaim: freed 2.865 GiB
[ Info: Epoch 100 train time 0.031
[ Info: Total time 4.307

Note that it takes a while to reach steady-state, so I'm only showing the final epochs.

Enabling the new heuristic:

❯ JULIA_CUDA_GC_EARLY=true JULIA_CUDA_HARD_MEMORY_LIMIT=4GiB julia --project wip.jl
...
[ Info: Epoch 90 train time 0.031
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
maybe_collect: collected 1.869 GiB (9.0% < 10.0%) while using 3.004 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 91 train time 0.033
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 92 train time 0.031
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 93 train time 0.031
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 94 train time 0.03
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 95 train time 0.03
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
maybe_collect: collected 1.869 GiB (9.0% < 10.0%) while using 3.004 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 96 train time 0.033
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 97 train time 0.03
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 98 train time 0.03
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 99 train time 0.03
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 100 train time 0.03
[ Info: Total time 3.76

Part of the advantage seems comes from the fact that collecting earlier makes it possible for memory to be made available to the memory pool without having to explicity synchronize. Before, we called the GC when we were already at 100% memory usage, and because memory gets freed asynchronously (i.e. it only becomes available when the free actually executes), that meant that we often had to also wait for the GPU to finish its current work. While now, because of collecting earlier, we give the free a chance to materialize obviating explicit synchronization.

Everybody, please test this out on your code, or share (easily reproducible) MWEs that illustrate problems.

@maleadt maleadt marked this pull request as ready for review April 18, 2024 19:38
@maleadt
Copy link
Member Author

maleadt commented Apr 22, 2024

I don't consider this ready, but am going to go ahead and merge this to avoid excessive conflicts with the memory refactor I'm doing in #2335.

@maleadt maleadt merged commit 7e07ecc into master Apr 22, 2024
1 check passed
@maleadt maleadt deleted the tb/maybe_collect branch April 22, 2024 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda array Stuff about CuArray. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants