-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression on matrix multiplication between CUDA.jl 1.3.3 and 2.1.0/master #538
Comments
CUDA 11.2 has been released (and is supported by CUDA.jl#master), so it might be a good time to re-evaluate if the new GEMM APIs are still slower. If so, maybe we should consider using the old APIs again, but generally that's not going to be a lasting solution (I expect them to get deprecated in favor of the new APIs at some point in the future). |
Maybe #671 also helps here. |
Testing the dot product on CUDA.jl on Windows 10 for both CUDA 11.1 and CUDA 11.2 resulted in a performance aligned with what was observed on Ubuntu. That is, about 10% slower than on CUDA.jl v1.3.3, but still a great improvement over the 50%+ gap previously reported. julia> @benchmark CUDA.@sync $x1 * $x2
BenchmarkTools.Trial:
memory estimate: 384 bytes
allocs estimate: 18
--------------
minimum time: 148.700 μs (0.00% GC)
median time: 153.900 μs (0.00% GC)
mean time: 162.191 μs (0.14% GC)
maximum time: 11.928 ms (19.38% GC)
--------------
samples: 10000
evals/sample: 1
I don't know if you feel worthy pushing further the investigation for that remaining 10% gap. From my user perspective, that gap is not a material concern. |
This issue is pretty stale, and measurements would need to be updated. If anything, the issue looked like an upstream CUDA one. If still relevant, feel free to open a new issue. |
This issue is a follow up from the discussion on Discourse.
Summary: performance regression between CUDA.jl v1.3.3 and current/latest version v2.1.0/master.
Performance difference is significant on a Windows 10 machine GTX-1660Ti(~50%).
Gap is also present but much small (~5%-10%) on a Ubuntu 20.04 machine, RTX 2080 Super.
No difference was observed on Windows between CUDA v2.1.1 and current master (2020-11-10), other than CUDA moving from 11.1.0 to 11.1.1.
It should be noted as well than on the Ubuntu 20.04, the CUDA version was 11.0 for both v1.3.3 and latest master (don't know why master isn't running on CUDA 11.1).
@maleadt Looking back at you discourse comment, it also appears that the test you performed on your side resulted in a slightly faster performance on v1.3.3 than v2.1.0 (median time: 48.476 μs (0.00% GC) vs median time: 53.785 μs (0.00% GC) for master) which is in line the kind of gap I observed on my Ubuntu 2080 setup.
Maybe the gap is somehow exacerbated either on Windows or with the specific GTX1660ti which misses tensor cores.
Windows 10 machine
Benchmark time
CUDA v1.3.3
CUDA v2.1.0
Debug
CUDA v1.3.3
CUDA v2.1.0
Profiler:
v1.3.3
v2.1.0
Ubuntu 20.04
Benchmark
CUDA v1.3.3
CUDA v2.1.0/master
Debug
CUDA v1.3.3
CUDA v2.1.0
The text was updated successfully, but these errors were encountered: