v5.3.0: regression in Zygote performance #2333

AlexLewandowski · 2024-04-17T02:22:03Z

Describe the bug

Performance degradation on CUDA#v5.3.0 when taking gradients using Flux/Zygote.

To reproduce

The Minimal Working Example (MWE) for this bug:

import BenchmarkTools: @btime
using Flux
using CUDA
import Flux.Zygote

m = Chain(Dense(10, 512), Dense(512, 512), Dense(512, 10)) |> Flux.gpu
xs = randn(Float32, (10, 256)) |> Flux.gpu

function get_grads(m, xs)
    gs = Zygote.gradient(m) do m_
        sum(m_(xs))
    end
end

@btime get_grads($m, $xs)

# On CUDA 5.2:
# julia> @btime get_grads($m, $xs)
#   216.330 μs (585 allocations: 26.28 KiB)

# On CUDA 5.3:
# julia> @btime get_grads($m, $xs)
#  1.270 ms (1022 allocations: 34.69 KiB)

Manifest file for CUDAv5.3.0: https://gist.github.com/AlexLewandowski/e1b62445fb814d2adf1a7b87ff7d6a3b

Manifest file for CUDAv5.2.0: https://gist.github.com/AlexLewandowski/91fe5e60893039c1c45e2a317d1d7714

Expected behavior

Performance to be unaffected by CUDA.jl version upgrade.

Version info

Details on Julia:

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 33 on 32 virtual cores

Details on CUDA#v5.3.0:

CUDA runtime 12.4, artifact installation
CUDA driver 12.2
NVIDIA driver 535.171.4

CUDA libraries: 
- CUBLAS: 12.4.5
- CURAND: 10.3.5
- CUFFT: 11.2.1
- CUSOLVER: 11.6.1
- CUSPARSE: 12.3.1
- CUPTI: 22.0.0
- NVML: 12.0.0+535.171.4

Julia packages: 
- CUDA: 5.3.0
- CUDA_Driver_jll: 0.8.1+0
- CUDA_Runtime_jll: 0.12.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce GTX 1080 Ti (sm_61, 1.004 GiB / 11.000 GiB available)

Details on CUDA#v5.2.0:

CUDA runtime 12.3, artifact installation
CUDA driver 12.2
NVIDIA driver 535.171.4

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 12.0.0+535.171.4

Julia packages: 
- CUDA: 5.2.0
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.11.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce GTX 1080 Ti (sm_61, 3.067 GiB / 11.000 GiB available)

Additional context

I upgraded to v5.3.0 because I needed to take a gradient of a sorted CuArray with dims as a keyword. Not sure if its the version upgrade itself, or some combination of bad drivers. But I thought it might be worth raising as an issue.

The text was updated successfully, but these errors were encountered:

maleadt · 2024-04-18T08:31:16Z

Thanks for the report. I can't reproduce this locally, or at least not to the extent you're seeing (only a 280->310us regression). That makes it much harder to pinpoint what exactly has slowed down. Since you see a much more pronounced slowdown, can you isolate this problem to either the CUDA.jl operation that has regressed, or the commit that did so?

christiangnrd · 2024-04-18T15:37:47Z

I took the time to bisect this because it's causing my model training to completely stall. The performance regression seems to be #2290, but it also seems like #2327 (merged but not released) fixes it.

pawbz · 2024-04-18T15:59:56Z

I have the same issue after the upgrade. Please let me know if you need any other information, I have attached a Pluto file

https://gist.github.com/pawbz/36a915406266df540187049c1e0720b4

maleadt · 2024-04-18T17:18:35Z

@AlexLewandowski @pawbz Can you try the CUDA.jl master branch?

pawbz · 2024-04-18T17:28:38Z

I have tried, no change, unfortunately.
Thanks for quick reply.

christiangnrd · 2024-04-18T18:48:56Z

Hey @pawbs, looking at your screenshot, I suspect your CUDA version did not update. Can you show the output of Pkg.status() in your notebook? Also make sure you restart the Pluto instance to make sure you load the correct version of CUDA.

You might also want to do this in a temporary environment by adding Pkg.activate(temp=true) right after you import Pkg to avoid cluttering up your default environment.

jeremiedb · 2024-04-18T23:09:43Z

I just compared the original benchmark between v5.2.0 and current master:

@btime get_grads($m, $xs);
# v5.2.0:  230.077 μs (585 allocations: 26.28 KiB)
# master: 254.714 μs (889 allocations: 33.66 KiB)

The bulk of the regression is now gone. There remains a ~10% consistent with @maleadt result along increased allocations. Is it an expected impact of v5.3.0 or worth keep the issue open?

pawbz · 2024-04-19T01:57:35Z

Pkg.activate(temp=true)

Here is an updated screenshot after restarting Pluto every time.
So basically, we have around 530us for both master and v5.2.0, and 1.2ms for v5.3.0
Thanks for the input earlier.

maleadt · 2024-04-19T06:59:13Z

Thanks for confirming. So this was fid by #2327.

There remains a ~10% consistent with @maleadt result along increased allocations. Is it an expected impact of v5.3.0 or worth keep the issue open?

Unexpected, but probably not worth keeping the issue open over. If you can isolate this to the operation that has regressed, please open a new issue.

AlexLewandowski added the bug Something isn't working label Apr 17, 2024

maleadt added needs information Further information is requested performance How fast can we go? and removed bug Something isn't working labels Apr 18, 2024

maleadt changed the title ~~Using CUDAv5.3.0 results in slower gradients with Zygote~~ v5.3.0: regression in Zygote performance Apr 18, 2024

maleadt closed this as completed Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v5.3.0: regression in Zygote performance #2333

v5.3.0: regression in Zygote performance #2333

AlexLewandowski commented Apr 17, 2024

maleadt commented Apr 18, 2024

christiangnrd commented Apr 18, 2024 •

edited

Loading

pawbz commented Apr 18, 2024 •

edited

Loading

maleadt commented Apr 18, 2024

pawbz commented Apr 18, 2024

christiangnrd commented Apr 18, 2024

jeremiedb commented Apr 18, 2024

pawbz commented Apr 19, 2024 •

edited

Loading

maleadt commented Apr 19, 2024

v5.3.0: regression in Zygote performance #2333

v5.3.0: regression in Zygote performance #2333

Comments

AlexLewandowski commented Apr 17, 2024

maleadt commented Apr 18, 2024

christiangnrd commented Apr 18, 2024 • edited Loading

pawbz commented Apr 18, 2024 • edited Loading

maleadt commented Apr 18, 2024

pawbz commented Apr 18, 2024

christiangnrd commented Apr 18, 2024

jeremiedb commented Apr 18, 2024

pawbz commented Apr 19, 2024 • edited Loading

maleadt commented Apr 19, 2024

christiangnrd commented Apr 18, 2024 •

edited

Loading

pawbz commented Apr 18, 2024 •

edited

Loading

pawbz commented Apr 19, 2024 •

edited

Loading