Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added kronecker product support for dense matrices #2177

Merged
merged 3 commits into from
Dec 21, 2023
Merged

Added kronecker product support for dense matrices #2177

merged 3 commits into from
Dec 21, 2023

Conversation

albertomercurio
Copy link
Contributor

No description provided.

@albertomercurio
Copy link
Contributor Author

Is anyone able to review this?

Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks like a good kernel implementation. I added a few comments.

It probably would have been nice to implement this in GPUArrays instead, so that other back-ends also get to benefit, but a CUDA.jl-only implementation is fine for now.

Did you benchmark the implementation? Computing the Kronecker product is very much memory bound, so it should be possible to get close to peak bandwidth, assuming the memory accesses are ordered correctly (maximizing coalescing, etc). I didn't pay close attention to that bit, it would probably be useful to run under NSight Compute for some insights.

lib/cublas/linalg.jl Outdated Show resolved Hide resolved
lib/cublas/linalg.jl Show resolved Hide resolved
@albertomercurio
Copy link
Contributor Author

The benchmarks returns four times faster calculation on my NVIDIA 4090, compared to my CPU AMD Ryzen Threadripper PRO 3975WX. I used the same matrices as in the test version.

kron!(C, A, B)

@benchmark kron!($C, $A, $B)

BenchmarkTools.Trial: 275 samples with 6 evaluations.
 Range (min  max):  5.480 μs  3.549 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     3.520 ms             ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.030 ms ± 1.216 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃                                                       █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▅
  5.48 μs     Histogram: log(frequency) by time     3.52 ms <

 Memory estimate: 1.58 KiB, allocs estimate: 26.

I was hoping for better results, but perhaps it is because of the relative small dimension of A (50 * 80 = 4000).

@maleadt
Copy link
Member

maleadt commented Dec 4, 2023

I was hoping for better results

Yeah, those are pretty small inputs. Try CUDA.@profile or CUDA.@bprofile for some more insights; it'll likely show how the actual kernel execution is much faster.

@albertomercurio
Copy link
Contributor Author

With CUDA.@profile:

Profiler ran for 2.33 ms, capturing 6 events.

Host-side activity: calling CUDA APIs took 208.85 µs (8.95% of the trace)
┌──────────┬────────────┬───────┬────────────────┐
│ Time (%) │ Total time │ Calls │ Name           │
├──────────┼────────────┼───────┼────────────────┤
│    1.64%38.15 µs │     1 │ cuLaunchKernel │
└──────────┴────────────┴───────┴────────────────┘

Device-side activity: GPU was busy for 2.07 ms (88.76% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Name                                                                                                 │
├──────────┼────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   88.76%2.07 ms │     1 │ _Z18_kron_mat_kernelA_13CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S1_S1_ │
└──────────┴────────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘

With CUDA.@bprofile:

Profiler ran for 1.0 s, capturing 3494 events.

Host-side activity: calling CUDA APIs took 998.05 ms (99.77% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬──────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name             │
├──────────┼────────────┼───────┼──────────────────────────────────────┼──────────────────┤
│   99.49%995.3 ms │   4991.99 ms ± 0.05   (  1.962.08)   │ cuCtxSynchronize │
│    0.21%2.05 ms │   4994.11 µs ± 13.02  (  2.62224.59) │ cuLaunchKernel   │
└──────────┴────────────┴───────┴──────────────────────────────────────┴──────────────────┘

Device-side activity: GPU was busy for 994.16 ms (99.38% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                                                                                 │
├──────────┼────────────┼───────┼────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   99.38%994.16 ms │   4991.99 ms ± 0.05   (  1.962.07) │ _Z18_kron_mat_kernelA_13CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S1_S1_ │
└──────────┴────────────┴───────┴────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘

NVTX ranges:
┌──────────┬────────────┬───────┬───────────────────────────────────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                 │ Name                │
├──────────┼────────────┼───────┼───────────────────────────────────┼─────────────────────┤
│   99.98%1.0 s │   4992.0 ms ± 0.05   (  1.972.3) │ @bprofile.iteration │
└──────────┴────────────┴───────┴───────────────────────────────────┴─────────────────────┘

@maleadt
Copy link
Member

maleadt commented Dec 11, 2023

Ah interesting, so it is kernel bound. It would probably be good to run under NSight Compute then, to see if we aren't missing anything (or whether we're just properly memory bound here).

@maleadt
Copy link
Member

maleadt commented Dec 21, 2023

Sorry, I won't have the time to optimize this, so I guess we can merge this as it is.

@maleadt maleadt merged commit ed30f60 into JuliaGPU:master Dec 21, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants