Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v5.3.0: regression in Zygote performance #2333

Closed
AlexLewandowski opened this issue Apr 17, 2024 · 9 comments
Closed

v5.3.0: regression in Zygote performance #2333

AlexLewandowski opened this issue Apr 17, 2024 · 9 comments
Labels
needs information Further information is requested performance How fast can we go?

Comments

@AlexLewandowski
Copy link

Describe the bug

Performance degradation on CUDA#v5.3.0 when taking gradients using Flux/Zygote.

To reproduce

The Minimal Working Example (MWE) for this bug:

import BenchmarkTools: @btime
using Flux
using CUDA
import Flux.Zygote

m = Chain(Dense(10, 512), Dense(512, 512), Dense(512, 10)) |> Flux.gpu
xs = randn(Float32, (10, 256)) |> Flux.gpu

function get_grads(m, xs)
    gs = Zygote.gradient(m) do m_
        sum(m_(xs))
    end
end

@btime get_grads($m, $xs)

# On CUDA 5.2:
# julia> @btime get_grads($m, $xs)
#   216.330 μs (585 allocations: 26.28 KiB)

# On CUDA 5.3:
# julia> @btime get_grads($m, $xs)
#  1.270 ms (1022 allocations: 34.69 KiB)

Manifest file for CUDAv5.3.0: https://gist.github.com/AlexLewandowski/e1b62445fb814d2adf1a7b87ff7d6a3b

Manifest file for CUDAv5.2.0: https://gist.github.com/AlexLewandowski/91fe5e60893039c1c45e2a317d1d7714

Expected behavior

Performance to be unaffected by CUDA.jl version upgrade.

Version info

Details on Julia:

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 33 on 32 virtual cores

Details on CUDA#v5.3.0:

CUDA runtime 12.4, artifact installation
CUDA driver 12.2
NVIDIA driver 535.171.4

CUDA libraries: 
- CUBLAS: 12.4.5
- CURAND: 10.3.5
- CUFFT: 11.2.1
- CUSOLVER: 11.6.1
- CUSPARSE: 12.3.1
- CUPTI: 22.0.0
- NVML: 12.0.0+535.171.4

Julia packages: 
- CUDA: 5.3.0
- CUDA_Driver_jll: 0.8.1+0
- CUDA_Runtime_jll: 0.12.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce GTX 1080 Ti (sm_61, 1.004 GiB / 11.000 GiB available)

Details on CUDA#v5.2.0:

CUDA runtime 12.3, artifact installation
CUDA driver 12.2
NVIDIA driver 535.171.4

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 12.0.0+535.171.4

Julia packages: 
- CUDA: 5.2.0
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.11.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce GTX 1080 Ti (sm_61, 3.067 GiB / 11.000 GiB available)

Additional context

I upgraded to v5.3.0 because I needed to take a gradient of a sorted CuArray with dims as a keyword. Not sure if its the version upgrade itself, or some combination of bad drivers. But I thought it might be worth raising as an issue.

@AlexLewandowski AlexLewandowski added the bug Something isn't working label Apr 17, 2024
@maleadt maleadt added needs information Further information is requested performance How fast can we go? and removed bug Something isn't working labels Apr 18, 2024
@maleadt
Copy link
Member

maleadt commented Apr 18, 2024

Thanks for the report. I can't reproduce this locally, or at least not to the extent you're seeing (only a 280->310us regression). That makes it much harder to pinpoint what exactly has slowed down. Since you see a much more pronounced slowdown, can you isolate this problem to either the CUDA.jl operation that has regressed, or the commit that did so?

@maleadt maleadt changed the title Using CUDAv5.3.0 results in slower gradients with Zygote v5.3.0: regression in Zygote performance Apr 18, 2024
@christiangnrd
Copy link
Contributor

christiangnrd commented Apr 18, 2024

I took the time to bisect this because it's causing my model training to completely stall. The performance regression seems to be #2290, but it also seems like #2327 (merged but not released) fixes it.

@pawbz
Copy link

pawbz commented Apr 18, 2024

I have the same issue after the upgrade. Please let me know if you need any other information, I have attached a Pluto file
image

https://gist.github.com/pawbz/36a915406266df540187049c1e0720b4

@maleadt
Copy link
Member

maleadt commented Apr 18, 2024

@AlexLewandowski @pawbz Can you try the CUDA.jl master branch?

@pawbz
Copy link

pawbz commented Apr 18, 2024

I have tried, no change, unfortunately.
Thanks for quick reply.
image

@christiangnrd
Copy link
Contributor

Hey @pawbs, looking at your screenshot, I suspect your CUDA version did not update. Can you show the output of Pkg.status() in your notebook? Also make sure you restart the Pluto instance to make sure you load the correct version of CUDA.

You might also want to do this in a temporary environment by adding Pkg.activate(temp=true) right after you import Pkg to avoid cluttering up your default environment.

@jeremiedb
Copy link

I just compared the original benchmark between v5.2.0 and current master:

@btime get_grads($m, $xs);
# v5.2.0:  230.077 μs (585 allocations: 26.28 KiB)
# master: 254.714 μs (889 allocations: 33.66 KiB)

The bulk of the regression is now gone. There remains a ~10% consistent with @maleadt result along increased allocations. Is it an expected impact of v5.3.0 or worth keep the issue open?

@pawbz
Copy link

pawbz commented Apr 19, 2024

Pkg.activate(temp=true)

Here is an updated screenshot after restarting Pluto every time.
So basically, we have around 530us for both master and v5.2.0, and 1.2ms for v5.3.0
Thanks for the input earlier.

image image image

@maleadt
Copy link
Member

maleadt commented Apr 19, 2024

Thanks for confirming. So this was fid by #2327.

There remains a ~10% consistent with @maleadt result along increased allocations. Is it an expected impact of v5.3.0 or worth keep the issue open?

Unexpected, but probably not worth keeping the issue open over. If you can isolate this to the operation that has regressed, please open a new issue.

@maleadt maleadt closed this as completed Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs information Further information is requested performance How fast can we go?
Projects
None yet
Development

No branches or pull requests

5 participants