Performance regression on matrix multiplication between CUDA.jl 1.3.3 and 2.1.0/master #538

jeremiedb · 2020-11-11T04:22:46Z

This issue is a follow up from the discussion on Discourse.

Summary: performance regression between CUDA.jl v1.3.3 and current/latest version v2.1.0/master.
Performance difference is significant on a Windows 10 machine GTX-1660Ti(~50%).
Gap is also present but much small (~5%-10%) on a Ubuntu 20.04 machine, RTX 2080 Super.

No difference was observed on Windows between CUDA v2.1.1 and current master (2020-11-10), other than CUDA moving from 11.1.0 to 11.1.1.

It should be noted as well than on the Ubuntu 20.04, the CUDA version was 11.0 for both v1.3.3 and latest master (don't know why master isn't running on CUDA 11.1).

@maleadt Looking back at you discourse comment, it also appears that the test you performed on your side resulted in a slightly faster performance on v1.3.3 than v2.1.0 (median time: 48.476 μs (0.00% GC) vs median time: 53.785 μs (0.00% GC) for master) which is in line the kind of gap I observed on my Ubuntu 2080 setup.
Maybe the gap is somehow exacerbated either on Windows or with the specific GTX1660ti which misses tensor cores.

Windows 10 machine

Benchmark time

CUDA v1.3.3

using CUDA
using BenchmarkTools
x1, x2 = CuArray(rand(Float32, 128, 256)), CuArray(rand(Float32, 256, 1024));
julia> @benchmark CUDA.@sync $x1 * $x2
BenchmarkTools.Trial: 
  memory estimate:  416 bytes
  allocs estimate:  11
  --------------
  minimum time:     130.599 μs (0.00% GC)
  median time:      136.699 μs (0.00% GC)
  mean time:        141.856 μs (0.19% GC)
  maximum time:     11.526 ms (23.41% GC)
  --------------
  samples:          10000
  evals/sample:     1

  [1520ce14] AbstractTrees v0.3.3
  [79e6a3ab] Adapt v2.3.0
  [052768ef] CUDA v1.3.3 `https://github.com/JuliaGPU/CUDA.jl.git#v1.3.3`
  [944b1d66] CodecZlib v0.7.0
  [5ae59095] Colors v0.12.4
  [d9f16b24] Functors v0.1.0
  [e5e0dc1b] Juno v0.8.4
  [1914dd2f] MacroTools v0.5.6
  [872c559c] NNlib v0.7.5
  [189a3867] Reexport v0.2.0
  [2913bbd2] StatsBase v0.33.2
  [a5390f91] ZipFile v0.9.3
  [e88e6eb3] Zygote v0.5.9
  [8bb1440f] DelimitedFiles
  [37e2e46d] LinearAlgebra
  [44cfe95a] Pkg
  [de0858da] Printf
  [9a3f8284] Random
  [ea8e919c] SHA
  [10745b16] Statistics
  [8dfed614] Test

CUDA v2.1.0

julia> @benchmark CUDA.@sync $x1 * $x2
BenchmarkTools.Trial: 
  memory estimate:  480 bytes
  allocs estimate:  24
  --------------
  minimum time:     230.599 μs (0.00% GC)
  median time:      238.301 μs (0.00% GC)
  mean time:        243.557 μs (0.11% GC)
  maximum time:     11.614 ms (23.76% GC)
  --------------

  [1520ce14] AbstractTrees v0.3.3
  [79e6a3ab] Adapt v2.3.0
  [052768ef] CUDA v2.1.0
  [944b1d66] CodecZlib v0.7.0
  [5ae59095] Colors v0.12.4
  [d9f16b24] Functors v0.1.0
  [e5e0dc1b] Juno v0.8.4
  [1914dd2f] MacroTools v0.5.6
  [872c559c] NNlib v0.7.5
  [189a3867] Reexport v0.2.0
  [2913bbd2] StatsBase v0.33.2
  [a5390f91] ZipFile v0.9.3
  [e88e6eb3] Zygote v0.5.9
  [8bb1440f] DelimitedFiles
  [37e2e46d] LinearAlgebra
  [44cfe95a] Pkg
  [de0858da] Printf
  [9a3f8284] Random
  [ea8e919c] SHA
  [10745b16] Statistics
  [8dfed614] Test

Debug

CUDA v1.3.3

julia> x1 * x2
I! cuBLAS (v11.0) function cublasStatus_t __stdcall cublasCreate_v2(cublasContext **) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x000000000BA2CA10)
i! Time: 2020-11-09T20:47:09 elapsed from start 0.100000 minutes or 6.000000 seconds
i!Process=16420; Thread=12072; GPU=0; Handle=POINTER (IN HEX:0x0000000000000000)
i! COMPILED WITH: Microsoft Visual Studio / 192027508.1
I! cuBLAS (v11.0) function cublasStatus_t __stdcall cublasSgemm_v2(cublasContext *, cublasOperation_t, cublasOperation_t, int, int, int, const float *, const float *, int, const float *, int, const float *, float *, int) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x000000007CF26010)
i!  transa: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  transb: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  m: type=int; val=128
i!  n: type=int; val=1024
i!  k: type=int; val=256
i!  alpha: type=float; val=POINTER (IN HEX:0x000000000CC8D490)
i!  A: type=float; val=POINTER (IN HEX:0x0000000703E00000)
i!  lda: type=int; val=128
i!  B: type=float; val=POINTER (IN HEX:0x0000000703E20000)
i!  ldb: type=int; val=256
i!  beta: type=float; val=POINTER (IN HEX:0x000000000CC8D6D0)
i!  C: type=float; val=POINTER (IN HEX:0x0000000703F20000)
i!  ldc: type=int; val=128
i! Time: 2020-11-09T20:47:10 elapsed from start 0.116667 minutes or 7.000000 seconds
i!Process=16420; Thread=12072; GPU=0; Handle=POINTER (IN HEX:0x000000007CF26010); StreamId=POINTER (IN HEX:0x0000000000000000) (defaultStream); MathMode=CUBLAS_DEFAULT_MATH
i! COMPILED WITH: Microsoft Visual Studio / 192027508.1
128×1024 CuArray{Float32,2}:

CUDA v2.1.0

julia> x1 * x2
I! cuBLAS (v11.1) function cublasStatus_t __stdcall cublasGemmEx(cublasContext *, cublasOperation_t, cublasOperation_t, int, int, int, const void *, const void *, cudaDataType_t, int, const void *, cudaDataType_t, int, const void *, void *, cudaDataType_t, int, cublasComputeType_t, cublasGemmAlgo_t) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x000000008B914F10)
i!  transa: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  transb: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  m: type=int; val=128
i!  n: type=int; val=1024
i!  k: type=int; val=256
i!  alpha: type=void; val=POINTER (IN HEX:0x000000003F55AF30)
i!  A: type=void; val=POINTER (IN HEX:0x0000000703F40000)
i!  Atype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  lda: type=int; val=128
i!  B: type=void; val=POINTER (IN HEX:0x0000000704100000)
i!  Btype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  ldb: type=int; val=256
i!  beta: type=void; val=POINTER (IN HEX:0x000000003F55AF40)
i!  C: type=void; val=POINTER (IN HEX:0x0000000703E00000)
i!  Ctype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  ldc: type=int; val=128
i!  computeType: type=cublasComputeType_t; val=CUBLAS_COMPUTE_32F(68)
i!  algo: type=SOME TYPE; val=CUBLAS_GEMM_DEFAULT(-1)
i! Time: 2020-11-09T20:40:28 elapsed from start 535.250000 minutes or 32115.000000 seconds
i!Process=13808; Thread=17260; GPU=0; Handle=POINTER (IN HEX:0x000000008B914F10); StreamId=POINTER (IN HEX:0x0000000000000002); MathMode=CUBLAS_TENSOR_OP_MATH | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION
i! COMPILED WITH: Microsoft Visual Studio / 192027508.1
128×1024 CuArray{Float32,2}:

Profiler:

v1.3.3

==13268== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   97.03%  55.457us         1  55.457us  55.457us  55.457us  volta_sgemm_32x32_sliced1x4_nn
                    2.97%  1.6960us         1  1.6960us  1.6960us  1.6960us  [CUDA memcpy HtoD]
      API calls:   99.91%  1.58954s         3  529.85ms  1.1000us  984.31ms  cudaFree
                    0.06%  1.0030ms         4  250.75us  8.4000us  928.50us  cudaMalloc
                    0.01%  86.000us        88     977ns     300ns  10.600us  cudaFuncSetAttribute
                    0.00%  62.200us         1  62.200us  62.200us  62.200us  cudaLaunchKernel
                    0.00%  56.300us         2  28.150us  13.300us  43.000us  cuDeviceTotalMem
                    0.00%  34.500us        18  1.9160us     700ns  14.700us  cudaEventCreateWithFlags
                    0.00%  31.700us         1  31.700us  31.700us  31.700us  cudaMemcpy
                    0.00%  28.400us       196     144ns     100ns     800ns  cuDeviceGetAttribute
                    0.00%  18.200us         1  18.200us  18.200us  18.200us  cuMemAlloc
                    0.00%  14.600us         1  14.600us  14.600us  14.600us  cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
                    0.00%  6.9000us         3  2.3000us     300ns  6.2000us  cuDeviceGetCount
                    0.00%  4.7000us        12     391ns     200ns  1.3000us  cudaDeviceGetAttribute
                    0.00%  4.1000us         2  2.0500us     900ns  3.2000us  cuInit
                    0.00%  1.5000us         1  1.5000us  1.5000us  1.5000us  cudaGetDevice
                    0.00%  1.4000us         1  1.4000us  1.4000us  1.4000us  cuCtxGetCurrent
                    0.00%  1.3000us         2     650ns     600ns     700ns  cuDeviceGetName
                    0.00%  1.0000us         2     500ns     300ns     700ns  cuDriverGetVersion
                    0.00%     500ns         2     250ns     200ns     300ns  cuDeviceGet
                    0.00%     500ns         2     250ns     200ns     300ns  cuDeviceGetLuid
                    0.00%     400ns         2     200ns     200ns     200ns  cuDeviceGetUuid
                    0.00%     300ns         1     300ns     300ns     300ns  cudaGetLastError

v2.1.0

GPU activities:   93.56%  210.95us         1  210.95us  210.95us  210.95us  volta_s884gemm_128x128_ldg8_f2f_nn
                    5.80%  13.088us         1  13.088us  13.088us  13.088us  void splitKreduce_kernel<float, float, float, float>(cublasSplitKParams<float>, float const *, float const *, float*, float const *, float const *, float const *)
                    0.64%  1.4400us         1  1.4400us  1.4400us  1.4400us  [CUDA memcpy HtoD]
      API calls:   99.86%  1.05177s         3  350.59ms  1.3000us  688.71ms  cudaFree
                    0.08%  806.50us         4  201.63us  3.3000us  766.00us  cudaMalloc
                    0.04%  423.50us         2  211.75us  18.100us  405.40us  cuMemAlloc
                    0.01%  88.100us         2  44.050us  9.0000us  79.100us  cudaLaunchKernel
                    0.00%  48.300us         2  24.150us  15.700us  32.600us  cuDeviceTotalMem
                    0.00%  45.300us        88     514ns     300ns  3.4000us  cudaFuncSetAttribute
                    0.00%  32.300us       200     161ns     100ns  1.5000us  cuDeviceGetAttribute
                    0.00%  24.800us         1  24.800us  24.800us  24.800us  cudaMemcpy
                    0.00%  17.800us        18     988ns     400ns  6.5000us  cudaEventCreateWithFlags
                    0.00%  8.8000us         1  8.8000us  8.8000us  8.8000us  cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
                    0.00%  6.3000us         3  2.1000us     300ns  5.3000us  cuDeviceGetCount
                    0.00%  5.2000us         2  2.6000us     600ns  4.6000us  cuInit
                    0.00%  4.7000us        12     391ns     200ns  1.3000us  cudaDeviceGetAttribute
                    0.00%  1.7000us         2     850ns     300ns  1.4000us  cuCtxGetCurrent
                    0.00%  1.5000us         2     750ns     300ns  1.2000us  cuDriverGetVersion
                    0.00%  1.3000us         2     650ns     500ns     800ns  cuDeviceGetName
                    0.00%  1.3000us         1  1.3000us  1.3000us  1.3000us  cudaGetDevice
                    0.00%     600ns         2     300ns     300ns     300ns  cuDeviceGet
                    0.00%     500ns         2     250ns     200ns     300ns  cuDeviceGetLuid
                    0.00%     400ns         2     200ns     200ns     200ns  cuDeviceGetUuid
                    0.00%     300ns         2     150ns     100ns     200ns  cudaGetLastError

Ubuntu 20.04

Benchmark

CUDA v1.3.3

julia> @benchmark CUDA.@sync $x1 * $x2
BenchmarkTools.Trial:
  memory estimate:  416 bytes
  allocs estimate:  11
  --------------
  minimum time:     38.051 μs (0.00% GC)
  median time:      66.581 μs (0.00% GC)
  mean time:        66.841 μs (0.00% GC)
  maximum time:     619.268 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

CUDA v2.1.0/master

julia> @benchmark CUDA.@sync $x1 * $x2
BenchmarkTools.Trial:
  memory estimate:  480 bytes
  allocs estimate:  24
  --------------
  minimum time:     42.620 μs (0.00% GC)
  median time:      70.241 μs (0.00% GC)
  mean time:        70.507 μs (0.00% GC)
  maximum time:     318.634 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1


Status `~/github/Flux.jl/Project.toml`
  [1520ce14] AbstractTrees v0.3.3
  [79e6a3ab] Adapt v2.3.0
  [052768ef] CUDA v2.1.0 `https://github.com/JuliaGPU/CUDA.jl#master`
  [944b1d66] CodecZlib v0.7.0
  [5ae59095] Colors v0.12.4
  [d9f16b24] Functors v0.1.0
  [e5e0dc1b] Juno v0.8.4
  [1914dd2f] MacroTools v0.5.6
  [872c559c] NNlib v0.7.5
  [189a3867] Reexport v0.2.0
  [2913bbd2] StatsBase v0.33.2
  [a5390f91] ZipFile v0.9.3
  [e88e6eb3] Zygote v0.5.9
  [8bb1440f] DelimitedFiles
  [37e2e46d] LinearAlgebra
  [44cfe95a] Pkg
  [de0858da] Printf
  [9a3f8284] Random
  [ea8e919c] SHA
  [10745b16] Statistics
  [8dfed614] Test


julia> CUDA.versioninfo()
CUDA toolkit 11.0.3, artifact installation
CUDA driver 11.0.0
NVIDIA driver 450.80.2

Libraries:
- CUBLAS: 11.2.0
- CURAND: 10.2.1
- CUFFT: 10.2.1
- CUSOLVER: 10.6.0
- CUSPARSE: 11.1.1
- CUPTI: 13.0.0
- NVML: 11.0.0+450.80.2
- CUDNN: 8.0.4 (for CUDA 11.0.0)
- CUTENSOR: 1.2.1 (for CUDA 11.0.0)

Toolchain:
- Julia: 1.5.0
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

1 device:
  0: GeForce RTX 2080 SUPER (sm_75, 7.211 GiB / 7.792 GiB available)

Debug

CUDA v1.3.3

julia> x1 * x2
I! cuBLAS (v11.0) function cublasStatus_t cublasCreate_v2(cublasContext**) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0x7f31cfc2caa0)
i! Time: 2020-11-10T23:10:59 elapsed from start 0.433333 minutes or 26.000000 seconds
i!Process=661093; Thread=139854480982592; GPU=0; Handle=POINTER (IN HEX:0x(nil))
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)
I! cuBLAS (v11.0) function cublasStatus_t cublasSgemm_v2(cublasHandle_t, cublasOperation_t, cublasOperation_t, int, int, int, const float*, const float*, int, const float*, int, const float*, float*, int) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0xa36a800)
i!  transa: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  transb: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  m: type=int; val=128
i!  n: type=int; val=1024
i!  k: type=int; val=256
i!  alpha: type=float; val=POINTER (IN HEX:0x0x7f31d0eef350)
i!  A: type=float; val=POINTER (IN HEX:0x0x7f30f6c00000)
i!  lda: type=int; val=128
i!  B: type=float; val=POINTER (IN HEX:0x0x7f30f6c20000)
i!  ldb: type=int; val=256
i!  beta: type=float; val=POINTER (IN HEX:0x0x7f31d0eef3b0)
i!  C: type=float; val=POINTER (IN HEX:0x0x7f30f6d20000)
i!  ldc: type=int; val=128
i! Time: 2020-11-10T23:10:59 elapsed from start 0.433333 minutes or 26.000000 seconds
i!Process=661093; Thread=139854480982592; GPU=0; Handle=POINTER (IN HEX:0x0xa36a800); StreamId=POINTER (IN HEX:0x(nil)) (defaultStream); MathMode=CUBLAS_DEFAULT_MATH
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)

CUDA v2.1.0

julia> x1 * x2
I! cuBLAS (v11.0) function cublasStatus_t cublasGemmEx(cublasHandle_t, cublasOperation_t, cublasOperation_t, int, int, int, const void*, const void*, cudaDataType_t, int, const void*, cudaDataType_t, int, const void*, void*, cudaDataType_t, int, cublasComputeType_t, cublasGemmAlgo_t) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0xa086820)
i!  transa: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  transb: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  m: type=int; val=128
i!  n: type=int; val=1024
i!  k: type=int; val=256
i!  alpha: type=void; val=POINTER (IN HEX:0x0x7fe42c9bfa80)
i!  A: type=void; val=POINTER (IN HEX:0x0x7fe2b6e20600)
i!  Atype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  lda: type=int; val=128
i!  B: type=void; val=POINTER (IN HEX:0x0x7fe2b6e40600)
i!  Btype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  ldb: type=int; val=256
i!  beta: type=void; val=POINTER (IN HEX:0x0x7fe42c9bfa90)
i!  C: type=void; val=POINTER (IN HEX:0x0x7fe205600000)
i!  Ctype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  ldc: type=int; val=128
i!  computeType: type=cublasComputeType_t; val=CUBLAS_COMPUTE_32F(68)
i!  algo: type=SOME TYPE; val=CUBLAS_GEMM_DEFAULT(-1)
i! Time: 2020-11-10T23:18:20 elapsed from start 3.450000 minutes or 207.000000 seconds
i!Process=661191; Thread=140618193203776; GPU=0; Handle=POINTER (IN HEX:0x0xa086820); StreamId=POINTER (IN HEX:0x0x2); MathMode=CUBLAS_TENSOR_OP_MATH | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)

maleadt · 2021-01-23T17:25:58Z

CUDA 11.2 has been released (and is supported by CUDA.jl#master), so it might be a good time to re-evaluate if the new GEMM APIs are still slower. If so, maybe we should consider using the old APIs again, but generally that's not going to be a lasting solution (I expect them to get deprecated in favor of the new APIs at some point in the future).

maleadt · 2021-01-24T07:42:08Z

Maybe #671 also helps here.

jeremiedb · 2021-01-25T02:58:59Z

Testing the dot product on CUDA.jl on Windows 10 for both CUDA 11.1 and CUDA 11.2 resulted in a performance aligned with what was observed on Ubuntu. That is, about 10% slower than on CUDA.jl v1.3.3, but still a great improvement over the 50%+ gap previously reported.

julia> @benchmark CUDA.@sync $x1 * $x2
BenchmarkTools.Trial:
  memory estimate:  384 bytes
  allocs estimate:  18
  --------------
  minimum time:     148.700 μs (0.00% GC)
  median time:      153.900 μs (0.00% GC)
  mean time:        162.191 μs (0.14% GC)
  maximum time:     11.928 ms (19.38% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> CUDA.versioninfo()
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 461.9.0

Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+461.9
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.6.0-beta1.0
- LLVM: 11.0.0
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: GeForce GTX 1660 Ti with Max-Q Design (sm_75, 768.656 MiB / 6.000 GiB available)

I don't know if you feel worthy pushing further the investigation for that remaining 10% gap. From my user perspective, that gap is not a material concern.

maleadt · 2023-08-17T05:49:00Z

This issue is pretty stale, and measurements would need to be updated. If anything, the issue looked like an upstream CUDA one. If still relevant, feel free to open a new issue.

jeremiedb added the bug Something isn't working label Nov 11, 2020

maleadt added cuda libraries Stuff about CUDA library wrappers. performance How fast can we go? regression Something that used to work, doesn't anymore. and removed bug Something isn't working labels Nov 13, 2020

jeremiedb mentioned this issue Jan 22, 2021

RNN update to drop CUDNN, fix LSTM bug and output type stability FluxML/Flux.jl#1367

Merged

4 tasks

maleadt closed this as completed Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression on matrix multiplication between CUDA.jl 1.3.3 and 2.1.0/master #538

Performance regression on matrix multiplication between CUDA.jl 1.3.3 and 2.1.0/master #538

jeremiedb commented Nov 11, 2020

maleadt commented Jan 23, 2021

maleadt commented Jan 24, 2021

jeremiedb commented Jan 25, 2021

maleadt commented Aug 17, 2023

Performance regression on matrix multiplication between CUDA.jl 1.3.3 and 2.1.0/master #538

Performance regression on matrix multiplication between CUDA.jl 1.3.3 and 2.1.0/master #538

Comments

jeremiedb commented Nov 11, 2020

Windows 10 machine

Benchmark time

CUDA v1.3.3

CUDA v2.1.0

Debug

CUDA v1.3.3

CUDA v2.1.0

Profiler:

v1.3.3

v2.1.0

Ubuntu 20.04

Benchmark

CUDA v1.3.3

CUDA v2.1.0/master

Debug

CUDA v1.3.3

CUDA v2.1.0

maleadt commented Jan 23, 2021

maleadt commented Jan 24, 2021

jeremiedb commented Jan 25, 2021

maleadt commented Aug 17, 2023