Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reshape of a view breaks batched_mul on GPU (scalar indexing happens) #466

Open
reachtarunhere opened this issue Jan 26, 2023 · 3 comments
Labels

Comments

@reachtarunhere
Copy link

Useful Info:

NNlib v0.8.16

I actually have use of this to implement very efficient multi head attention. I am sharing the minimal example to replicate below:

using Flux # to get batched_mul

using CUDA
CUDA.allowscalar(false)

A3 = rand(3,3,8) |> gpu
v3 = rand(4, 2, 3, 8) |> gpu
v4, v5, = eachslice(v3, dims=2)
v4
v6 = reshape(v4, size(v4)...)
v4 ⊠ A3 # works
v6 ⊠ A3 # breaks throwing a scalar indexing error

Error for reference:

Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.

error(s::String) at error.jl:33
assertscalar(op::String) at GPUArraysCore.jl:100
getindex(::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::Int64, ::Int64, ::Int64, ::Int64) at indexing.jl:9
getindex at subarray.jl:276 [inlined]
_unsafe_getindex_rs at reshapedarray.jl:249 [inlined]
_unsafe_getindex at reshapedarray.jl:246 [inlined]
getindex at reshapedarray.jl:234 [inlined]
getindex at subarray.jl:276 [inlined]
_generic_matmatmul!(C::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, tA::Char, tB::Char, A::SubArray{Float32, 2, Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, false}, B::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, _add::LinearAlgebra.MulAddMul{true, true, Float32, Float32}) at matmul.jl:830
generic_matmatmul!(C::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, tA::Char, tB::Char, A::SubArray{Float32, 2, Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, false}, B::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, _add::LinearAlgebra.MulAddMul{true, true, Float32, Float32}) at matmul.jl:798
mul! at matmul.jl:302 [inlined]
batched_mul_generic!(C::CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, A::Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, B::CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, α::Float32, β::Float32) at batchedmul.jl:274
_batched_try_gemm! at batchedmul.jl:219 [inlined]
_batched_mul! at batchedmul.jl:211 [inlined]
batched_mul! at batchedmul.jl:205 [inlined]
batched_mul! at batchedmul.jl:205 [inlined]
_batched_mul at batchedmul.jl:61 [inlined]
batched_mul(A::Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, B::CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}) at batchedmul.jl:48
top-level scope at model.jl:135
@mcabbott
Copy link
Member

I think it's taking the generic path here:

are_strided(_unbatch(A), _unbatch(B)) || return batched_mul_generic!(C, A, B, α, β)

because is_strided on this array is says it's not strided:

julia> summary(v6)
"4×3×8 reshape(view(::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, :, 1, :, :), 4, 3, 8) with eltype Float32"

julia> NNlib.is_strided(v6)  # NNlib's test for whether to use BLAS
false

julia> strides(v6)  # maybe this is OK?
(1, 8, 24)

julia> strides(copy(v6))
(1, 4, 12)

That test is intended to match what CUBLAS accepts here:

https://github.com/FluxML/NNlibCUDA.jl/blob/master/src/batchedmul.jl#L3-L4

and in this case, it doesn't seem to accept v6:

julia> CUDA.CUBLAS.gemm_strided_batched!('N', 'N', true, v4, A3, false, v4  A3) |> summary
"4×3×8 CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}"

julia> CUDA.CUBLAS.gemm_strided_batched!('N', 'N', true, v6, A3, false, v4  A3) |> summary
ERROR: conversion to pointer not defined for Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] unsafe_convert(#unused#::Type{CuPtr{Float32}}, a::Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}})
   @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/pointer.jl:64

It might be possible to widen this... a very quick hack seems to run:

julia> Base.unsafe_convert(::Type{CuPtr{Float32}}, a::typeof(v6)) = Base.unsafe_convert(CuPtr{Float32}, parent(parent(v6)))

julia> CUDA.CUBLAS.gemm_strided_batched!('N', 'N', true, v6, A3, false, v4  A3) |> summary
"4×3×8 CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}"

But I have not checked correctness, and in reality the method would need to work only for reshapes which are in fact sufficiently simple. After updating CUDA to allow this, the test used by NNlib could be widened to match.

@ToucheSir
Copy link
Member

I think CUDA.jl is a little gun-shy about accommodating too many wrappers after JuliaGPU/CUDA.jl#453. Worth a shot though.

@mcabbott
Copy link
Member

This multi-wrapper cases is a little easier, in that creating a pointer must either succeed or fail. CUDA certainly owns unsafe_convert(::Type{CuPtr{Float32}}, a::Any).

The worse case for multiple wrappers is where Base has fun(A::AbstractArray, B::AbstractArray), and trying to make dispatch catch things like fun(A::ReshapedArray{... CuArray}, B::Adjoint{..., CuArray}) leads to an explosion of methods. NNlib's dispatch on storage_type here is an attempt to avoid that (a possibly too-complicated one).

@mcabbott mcabbott added the CUDA label Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants