Further unification of `CUBLAS.axpy!` and `LinearAlgebra.BLAS.axpy!` #432

kbarros · 2020-09-16T23:34:33Z

I'm interested in using axpy! with strides that differ from 1. Both BLAS and CUBLAS provide this functionality.

For CPU arrays, the method is defined in LinearAlgebra.BLAS.axpy! here

For CUDA arrays, the equivalent method is defined in CUBLAS.axpy! here

I'm wondering if the LinearAlgebra.BLAS.axpy! function should get methods for the CUBLAS.axpy! behavior along these lines:

for elty in (:Float64, :Float32, :ComplexF64, :ComplexF32)
    @eval begin
        function LinearAlgebra.BLAS.axpy!(n::Integer,
            alpha::($elty),
            dx::CuArray{$elty},
            incx::Integer,
            dy::CuArray{$elty},
            incy::Integer)

            CUBLAS.axpy!(n, alpha, dx, incx, dy, incy)
        end
    end
end

The text was updated successfully, but these errors were encountered:

maleadt · 2020-09-17T09:03:48Z

I've been wary extending LinearAlgebra.BLAS methods; they aren't public API, and rather seem to be the low-level wrappers for the CPU BLAS libraries Julia ships. CUBLAS only implements a subset, where the behavior is sometimes different.

Instead, why not a method for LinearAlgebra.axpy! implemented using CUBLAS.axpy!?

kbarros · 2020-09-17T16:44:44Z

Makes sense, I understand the concern now.

My aim in using LinearAlgebra.BLAS.axpy!(n, alpha, dx, incx, dy, incy) is to specify strides incx and incy different from 1. Note that CUBLAS.axpy! provides an equivalent method.

Julia's LinearAlgebra module does not document such an axpy! method in its public API. This indicates that there's no "official" way to use axpy! with arbitrary strides. Curiously though, the stdlib does document LinearAlgebra.BLAS.scal(n, a, X, incx) with arbitrary stride incx here.

In any case, I agree this is a question for the Julia stdlib, rather than CUDA. Thank you.

maleadt · 2020-09-18T10:36:07Z

What about just using broadcast on a strided array? It's (somewhat surprisingly) equally fast:

julia> X = CUDA.rand(1000,1000,1000);
julia> Y = similar(X);
julia> a = one(eltype(X));
julia> @benchmark CUDA.@sync blocking=false CUBLAS.axpy!($(length(Y)), $a, $X, 1, $Y, 1)
BenchmarkTools.Trial: 
  memory estimate:  224 bytes
  allocs estimate:  13
  --------------
  minimum time:     32.351 ms (0.00% GC)
  median time:      32.759 ms (0.00% GC)
  mean time:        32.773 ms (0.00% GC)
  maximum time:     34.986 ms (0.00% GC)
  --------------
  samples:          153
  evals/sample:     1
julia> @benchmark CUDA.@sync blocking=false $Y .= $X.*$a .+ $Y
BenchmarkTools.Trial: 
  memory estimate:  3.64 KiB
  allocs estimate:  67
  --------------
  minimum time:     32.273 ms (0.00% GC)
  median time:      32.678 ms (0.00% GC)
  mean time:        32.703 ms (0.00% GC)
  maximum time:     34.111 ms (0.00% GC)
  --------------
  samples:          153
  evals/sample:     1

Alternatively, we could implement LinearAlgebra.axpy! for a strided CuArray (without having to pass those explicitly) and have it dispatch to CUBLAS.

kbarros · 2020-09-18T16:25:51Z

Thanks for your help with this. I think my confusion arose from from the fact that LinearAlgebra.BLAS.axpy!(n, alpha, dx, incx, dy, incy) is not part of the public API, but many similar lower-level LinearAlgebra.BLAS routines are. Since that's a Julia stdlib question, I filed a separate issue.

maleadt · 2020-09-19T06:36:14Z

The issue you filed ignores my question though: why can't you use LinearAlgebra.axpy! with a StridedArray (which could then dispatch to any BLAS if possible)? No need for a strides argument.

kbarros · 2020-09-20T00:23:43Z

Interesting idea. Here's what I've tried so far:

N = 1000
Y = CUDA.rand(N,N)
X = CUDA.rand(N)

# Different ways to increment the diagonal elements of matrix Y

 # (v1) 118 allocations: 4.375 KiB
@time CUDA.@sync Y[1:N+1:N*N] .+= X

 # (v2) 2 allocations: 32 bytes
@time CUDA.@sync CUBLAS.axpy!(N, 1f32, X, 1, Y, N+1)

# (v3) ArgumentError: cannot take the CPU address of a CuArray{Float32,1}
LinearAlgebra.axpy!(1f32, X, @view Y[1:N+1:N*N])

(v1) It's awesome that we can just write the naive broadcasting code, and it runs fast!

(v2) I prefer the axpy!(n, alpha, dx, incx, dy, incy) variant because it allocates less, and is closest to the actual axpy routine in BLAS.

(v3) The final version with strided views works with CPU Arrays, but not currently with CuArrays. Is this what you're proposing to add to CUDA.jl? That would be cool if it's not a maintenance burden!

For my project, this feature isn't a blocker, obviously. But resolving the associated stdlib issue may eliminate a possible source of confusion.

kbarros · 2020-09-20T17:21:12Z

I am reopening, because on further reflection, I think @maleadt 's point is correct: Ideally, LinearAlgebra.axpy! would accept strided views of CuArrays. This seems to be also the intention behind the design of Julia's stdlib.

maleadt · 2020-09-21T11:09:30Z

See #432.

kbarros added the enhancement New feature or request label Sep 16, 2020

kbarros closed this as completed Sep 18, 2020

kbarros reopened this Sep 20, 2020

maleadt mentioned this issue Sep 21, 2020

Introduce StridedCuArray #435

Merged

maleadt closed this as completed in #435 Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further unification of `CUBLAS.axpy!` and `LinearAlgebra.BLAS.axpy!` #432

Further unification of `CUBLAS.axpy!` and `LinearAlgebra.BLAS.axpy!` #432

kbarros commented Sep 16, 2020

maleadt commented Sep 17, 2020

kbarros commented Sep 17, 2020

maleadt commented Sep 18, 2020

kbarros commented Sep 18, 2020

maleadt commented Sep 19, 2020

kbarros commented Sep 20, 2020

kbarros commented Sep 20, 2020

maleadt commented Sep 21, 2020

Further unification of CUBLAS.axpy! and LinearAlgebra.BLAS.axpy! #432

Further unification of CUBLAS.axpy! and LinearAlgebra.BLAS.axpy! #432

Comments

kbarros commented Sep 16, 2020

maleadt commented Sep 17, 2020

kbarros commented Sep 17, 2020

maleadt commented Sep 18, 2020

kbarros commented Sep 18, 2020

maleadt commented Sep 19, 2020

kbarros commented Sep 20, 2020

kbarros commented Sep 20, 2020

maleadt commented Sep 21, 2020

Further unification of `CUBLAS.axpy!` and `LinearAlgebra.BLAS.axpy!` #432

Further unification of `CUBLAS.axpy!` and `LinearAlgebra.BLAS.axpy!` #432