Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) #1944

Merged
merged 25 commits into from
Jan 17, 2024

Conversation

nikopj
Copy link
Contributor

@nikopj nikopj commented Jun 10, 2023

I've implemented a small working example of a 3-dimensional sparse array, CuSparseArrayCSR, which can be thought of as multiple CuSparseMatrixCSR stacked into a 3rd (batch) dimension. The only restriction I'm aware of is that the number of non-zeros of each matrix slices (batch-element) must be the same. The benefit of this representation is that we can more easily make use of CUDA's batched sparse mat muls, ex. Ci = Ai * Bi, which i've implemented for sparse * dense batched matmul in lib/cusparse/generic.jl (see bmm!). Example uses are in test/cusparse/bmm.jl.

I followed the cuSPARSE docs and the nvidia sample code for batched spmm. Based on the docs, I think this can be extended to CSC and COO representations, as well as different mat-mul cases. First, I'd like to get some feedback to see if this implementation is sensible.

This would be helpful for some neural network training cases. Similar capabilities are available in Pytorch, though I think they're implementation is more restrictive by not allowing different sparsity patterns between batch-elements.

Thanks,
Nikola

@maleadt maleadt added enhancement New feature or request cuda array Stuff about CuArray. labels Jun 13, 2023
@maleadt
Copy link
Member

maleadt commented Jun 13, 2023

Interesting! cc @amontoison

The only restriction I'm aware of is that the number of non-zeros of each matrix slices (batch-element) must be the same. The benefit of this representation is that we can more easily make use of CUDA's batched sparse mat muls, ex. Ci = Ai * Bi, which i've implemented for sparse * dense batched matmul in lib/cusparse/generic.jl (see bmm!).

If the representation isn't a fully generic 3D SparseArray, and the main use case is batched operations, why not a hypothetical Batched{CuSparseMatrix}? I'm concerned that users may think they can do more with a CuSparseArray type than they actually can.

@nikopj
Copy link
Contributor Author

nikopj commented Jun 13, 2023

If the representation isn't a fully generic 3D SparseArray, and the main use case is batched operations, why not a hypothetical Batched{CuSparseMatrix}? I'm concerned that users may think they can do more with a CuSparseArray type than they actually can.

Sure, that makes sense to me. Let me know if I'm understanding you correctly with this below,

mutable struct Batched{<: CuSparseMatrixCSR, Tv, Ti} <: AbstractCuSparseArray{Tv, Ti, 3}
    rowPtr::CuMatrix{Ti}
    colVal::CuMatrix{Ti}
    nzVal::CuMatrix{Tv}
    dims::NTuple{3,Int}
    nnz::Ti

    function Batched{A, Tv, Ti}(rowPtr::CuMatrix{<:Integer}, colVal::CuMatrix{<:Integer}, nzVal::CuMatrix, dims::NTuple{3,<:Integer}) where {Tv, Ti<:Integer}
        new{A, Tv, Ti}(rowPtr, colVal, nzVal, dims, length(nzVal))
    end
end

@amontoison
Copy link
Member

amontoison commented Jun 19, 2023

@nikopj Nice!
I like you proposition with the Batched structure.
We should be able to use it with different sparse matrices.

lib/cusparse/array.jl Outdated Show resolved Hide resolved
@nikopj nikopj changed the title CuSparseArrayCSR (3 dim array) with batched matmatmul (bmm) CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) Nov 10, 2023
@nikopj
Copy link
Contributor Author

nikopj commented Nov 10, 2023

Getting this going again.

I've made things more general by allowing the "batch" dimension of the CuSparseArrayCSR{Tv,Ti,N} to be several dimensions (N-2 batch dims). I'm motiavted to do this by some deep-learning sparse attention stuff, where we might have different sparse attention matrices per mini-batch element, and each mini-batch element might want to make use of several separate attention matrices (ex. multi-head self attention).

So N=3 is the same case as before, but N>3 is now also possible. I've made it such that batched sparse-dense matmul also works for N>3 if the sizes make sense. See the end of test/libraries/cusparse/bmm.jl for an example size (m,n,2,3) CuSparseArrayCSR.

Because of the extended dimensions, I'm in favor of keeping the original naming convention. It also makes sense to keep things specific to the sparsity type as that dictates the fields of the struct. If we end up implementing a similar type for CSC, COO, etc, we could do a "Batched" union type for all of them.

The printing / showing / indexing of CuSparseArrayCSR is also working better, making testing on the REPL easier.

Base.cat and Base.reshape are also working for sensible arguments.

@nikopj nikopj marked this pull request as ready for review November 10, 2023 03:56
@maleadt
Copy link
Member

maleadt commented Jan 2, 2024

CI failures look related.

@nikopj
Copy link
Contributor Author

nikopj commented Jan 3, 2024

@maleadt

CI failures look related.

It's failing on CUDA 11.4 and 11.5, + the julia nightly build (unrelated to cuSparse).
I'm not sure where the error lies in these versions as they API says it can handle batched sparse-dense mul.

A quick fix would be to only define bmm! for versions 11.6 and higher.

@maleadt
Copy link
Member

maleadt commented Jan 4, 2024

A quick fix would be to only define bmm! for versions 11.6 and higher.

At the very least, yes. You could make the tests check for the cuSPARSE version (see the start of the CI logs), and maybe even add an error to the function itself.

The nightly issue is unrelated indeed.

@maleadt maleadt merged commit 88ebe50 into JuliaGPU:master Jan 17, 2024
1 check passed
@maleadt
Copy link
Member

maleadt commented Jan 17, 2024

According to Aqua, this added a couple of ambiguities. Not sure why CI didn't spot those...

julia> ambs = Aqua.detect_ambiguities(CUDA; recursive=true)
 (kwcall(::NamedTuple, ::typeof(cat), As::CUDA.CUSPARSE.CuSparseMatrixCSR...) @ CUDA.CUSPARSE ~/Julia/pkg/CUDA/lib/cusparse/batched.jl:1, kwcall(::NamedTuple, ::typeof(cat), As::CUDA.CUSPARSE.CuSparseArrayCSR...) @ CUDA.CUSPARSE ~/Julia/pkg/CUDA/lib/cusparse/batched.jl:14)
 (cat(As::CUDA.CUSPARSE.CuSparseMatrixCSR...; dims) @ CUDA.CUSPARSE ~/Julia/pkg/CUDA/lib/cusparse/batched.jl:1, cat(As::CUDA.CUSPARSE.CuSparseArrayCSR...; dims) @ CUDA.CUSPARSE ~/Julia/pkg/CUDA/lib/cusparse/batched.jl:14)

Could you fix those?

@nikopj
Copy link
Contributor Author

nikopj commented Jan 17, 2024

Ok I'll take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda array Stuff about CuArray. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants