Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further unification of CUBLAS.axpy! and LinearAlgebra.BLAS.axpy! #432

Closed
kbarros opened this issue Sep 16, 2020 · 8 comments · Fixed by #435
Closed

Further unification of CUBLAS.axpy! and LinearAlgebra.BLAS.axpy! #432

kbarros opened this issue Sep 16, 2020 · 8 comments · Fixed by #435
Labels
enhancement New feature or request

Comments

@kbarros
Copy link

kbarros commented Sep 16, 2020

I'm interested in using axpy! with strides that differ from 1. Both BLAS and CUBLAS provide this functionality.

For CPU arrays, the method is defined in LinearAlgebra.BLAS.axpy! here

For CUDA arrays, the equivalent method is defined in CUBLAS.axpy! here

I'm wondering if the LinearAlgebra.BLAS.axpy! function should get methods for the CUBLAS.axpy! behavior along these lines:

for elty in (:Float64, :Float32, :ComplexF64, :ComplexF32)
    @eval begin
        function LinearAlgebra.BLAS.axpy!(n::Integer,
            alpha::($elty),
            dx::CuArray{$elty},
            incx::Integer,
            dy::CuArray{$elty},
            incy::Integer)

            CUBLAS.axpy!(n, alpha, dx, incx, dy, incy)
        end
    end
end
@kbarros kbarros added the enhancement New feature or request label Sep 16, 2020
@maleadt
Copy link
Member

maleadt commented Sep 17, 2020

I've been wary extending LinearAlgebra.BLAS methods; they aren't public API, and rather seem to be the low-level wrappers for the CPU BLAS libraries Julia ships. CUBLAS only implements a subset, where the behavior is sometimes different.

Instead, why not a method for LinearAlgebra.axpy! implemented using CUBLAS.axpy!?

@kbarros
Copy link
Author

kbarros commented Sep 17, 2020

Makes sense, I understand the concern now.

My aim in using LinearAlgebra.BLAS.axpy!(n, alpha, dx, incx, dy, incy) is to specify strides incx and incy different from 1. Note that CUBLAS.axpy! provides an equivalent method.

Julia's LinearAlgebra module does not document such an axpy! method in its public API. This indicates that there's no "official" way to use axpy! with arbitrary strides. Curiously though, the stdlib does document LinearAlgebra.BLAS.scal(n, a, X, incx) with arbitrary stride incx here.

In any case, I agree this is a question for the Julia stdlib, rather than CUDA. Thank you.

@maleadt
Copy link
Member

maleadt commented Sep 18, 2020

What about just using broadcast on a strided array? It's (somewhat surprisingly) equally fast:

julia> X = CUDA.rand(1000,1000,1000);
julia> Y = similar(X);
julia> a = one(eltype(X));
julia> @benchmark CUDA.@sync blocking=false CUBLAS.axpy!($(length(Y)), $a, $X, 1, $Y, 1)
BenchmarkTools.Trial: 
  memory estimate:  224 bytes
  allocs estimate:  13
  --------------
  minimum time:     32.351 ms (0.00% GC)
  median time:      32.759 ms (0.00% GC)
  mean time:        32.773 ms (0.00% GC)
  maximum time:     34.986 ms (0.00% GC)
  --------------
  samples:          153
  evals/sample:     1
julia> @benchmark CUDA.@sync blocking=false $Y .= $X.*$a .+ $Y
BenchmarkTools.Trial: 
  memory estimate:  3.64 KiB
  allocs estimate:  67
  --------------
  minimum time:     32.273 ms (0.00% GC)
  median time:      32.678 ms (0.00% GC)
  mean time:        32.703 ms (0.00% GC)
  maximum time:     34.111 ms (0.00% GC)
  --------------
  samples:          153
  evals/sample:     1

Alternatively, we could implement LinearAlgebra.axpy! for a strided CuArray (without having to pass those explicitly) and have it dispatch to CUBLAS.

@kbarros
Copy link
Author

kbarros commented Sep 18, 2020

Thanks for your help with this. I think my confusion arose from from the fact that LinearAlgebra.BLAS.axpy!(n, alpha, dx, incx, dy, incy) is not part of the public API, but many similar lower-level LinearAlgebra.BLAS routines are. Since that's a Julia stdlib question, I filed a separate issue.

@kbarros kbarros closed this as completed Sep 18, 2020
@maleadt
Copy link
Member

maleadt commented Sep 19, 2020

The issue you filed ignores my question though: why can't you use LinearAlgebra.axpy! with a StridedArray (which could then dispatch to any BLAS if possible)? No need for a strides argument.

@kbarros
Copy link
Author

kbarros commented Sep 20, 2020

Interesting idea. Here's what I've tried so far:

N = 1000
Y = CUDA.rand(N,N)
X = CUDA.rand(N)

# Different ways to increment the diagonal elements of matrix Y

 # (v1) 118 allocations: 4.375 KiB
@time CUDA.@sync Y[1:N+1:N*N] .+= X

 # (v2) 2 allocations: 32 bytes
@time CUDA.@sync CUBLAS.axpy!(N, 1f32, X, 1, Y, N+1)

# (v3) ArgumentError: cannot take the CPU address of a CuArray{Float32,1}
LinearAlgebra.axpy!(1f32, X, @view Y[1:N+1:N*N])

(v1) It's awesome that we can just write the naive broadcasting code, and it runs fast!

(v2) I prefer the axpy!(n, alpha, dx, incx, dy, incy) variant because it allocates less, and is closest to the actual axpy routine in BLAS.

(v3) The final version with strided views works with CPU Arrays, but not currently with CuArrays. Is this what you're proposing to add to CUDA.jl? That would be cool if it's not a maintenance burden!

For my project, this feature isn't a blocker, obviously. But resolving the associated stdlib issue may eliminate a possible source of confusion.

@kbarros
Copy link
Author

kbarros commented Sep 20, 2020

I am reopening, because on further reflection, I think @maleadt 's point is correct: Ideally, LinearAlgebra.axpy! would accept strided views of CuArrays. This seems to be also the intention behind the design of Julia's stdlib.

@maleadt
Copy link
Member

maleadt commented Sep 21, 2020

See #432.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants