Reshape of a view breaks batched_mul on GPU (scalar indexing happens) #466

reachtarunhere · 2023-01-26T08:36:57Z

Useful Info:

NNlib v0.8.16

I actually have use of this to implement very efficient multi head attention. I am sharing the minimal example to replicate below:

using Flux # to get batched_mul

using CUDA
CUDA.allowscalar(false)

A3 = rand(3,3,8) |> gpu
v3 = rand(4, 2, 3, 8) |> gpu
v4, v5, = eachslice(v3, dims=2)
v4
v6 = reshape(v4, size(v4)...)
v4 ⊠ A3 # works
v6 ⊠ A3 # breaks throwing a scalar indexing error

Error for reference:

Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.

error(s::String) at error.jl:33
assertscalar(op::String) at GPUArraysCore.jl:100
getindex(::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::Int64, ::Int64, ::Int64, ::Int64) at indexing.jl:9
getindex at subarray.jl:276 [inlined]
_unsafe_getindex_rs at reshapedarray.jl:249 [inlined]
_unsafe_getindex at reshapedarray.jl:246 [inlined]
getindex at reshapedarray.jl:234 [inlined]
getindex at subarray.jl:276 [inlined]
_generic_matmatmul!(C::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, tA::Char, tB::Char, A::SubArray{Float32, 2, Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, false}, B::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, _add::LinearAlgebra.MulAddMul{true, true, Float32, Float32}) at matmul.jl:830
generic_matmatmul!(C::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, tA::Char, tB::Char, A::SubArray{Float32, 2, Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, false}, B::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, _add::LinearAlgebra.MulAddMul{true, true, Float32, Float32}) at matmul.jl:798
mul! at matmul.jl:302 [inlined]
batched_mul_generic!(C::CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, A::Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, B::CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, α::Float32, β::Float32) at batchedmul.jl:274
_batched_try_gemm! at batchedmul.jl:219 [inlined]
_batched_mul! at batchedmul.jl:211 [inlined]
batched_mul! at batchedmul.jl:205 [inlined]
batched_mul! at batchedmul.jl:205 [inlined]
_batched_mul at batchedmul.jl:61 [inlined]
batched_mul(A::Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, B::CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}) at batchedmul.jl:48
top-level scope at model.jl:135

The text was updated successfully, but these errors were encountered:

mcabbott · 2023-01-27T14:10:36Z

I think it's taking the generic path here:

NNlib.jl/src/batched/batchedmul.jl

Line 219 in 7f6ea50

 are_strided(_unbatch(A), _unbatch(B)) || return batched_mul_generic!(C, A, B, α, β) 

because is_strided on this array is says it's not strided:

julia> summary(v6)
"4×3×8 reshape(view(::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, :, 1, :, :), 4, 3, 8) with eltype Float32"

julia> NNlib.is_strided(v6)  # NNlib's test for whether to use BLAS
false

julia> strides(v6)  # maybe this is OK?
(1, 8, 24)

julia> strides(copy(v6))
(1, 4, 12)

That test is intended to match what CUBLAS accepts here:

https://github.com/FluxML/NNlibCUDA.jl/blob/master/src/batchedmul.jl#L3-L4

and in this case, it doesn't seem to accept v6:

julia> CUDA.CUBLAS.gemm_strided_batched!('N', 'N', true, v4, A3, false, v4 ⊠ A3) |> summary
"4×3×8 CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}"

julia> CUDA.CUBLAS.gemm_strided_batched!('N', 'N', true, v6, A3, false, v4 ⊠ A3) |> summary
ERROR: conversion to pointer not defined for Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] unsafe_convert(#unused#::Type{CuPtr{Float32}}, a::Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}})
   @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/pointer.jl:64

It might be possible to widen this... a very quick hack seems to run:

julia> Base.unsafe_convert(::Type{CuPtr{Float32}}, a::typeof(v6)) = Base.unsafe_convert(CuPtr{Float32}, parent(parent(v6)))

julia> CUDA.CUBLAS.gemm_strided_batched!('N', 'N', true, v6, A3, false, v4 ⊠ A3) |> summary
"4×3×8 CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}"

But I have not checked correctness, and in reality the method would need to work only for reshapes which are in fact sufficiently simple. After updating CUDA to allow this, the test used by NNlib could be widened to match.

ToucheSir · 2023-01-27T15:24:01Z

I think CUDA.jl is a little gun-shy about accommodating too many wrappers after JuliaGPU/CUDA.jl#453. Worth a shot though.

mcabbott · 2023-01-29T17:24:08Z

This multi-wrapper cases is a little easier, in that creating a pointer must either succeed or fail. CUDA certainly owns unsafe_convert(::Type{CuPtr{Float32}}, a::Any).

The worse case for multiple wrappers is where Base has fun(A::AbstractArray, B::AbstractArray), and trying to make dispatch catch things like fun(A::ReshapedArray{... CuArray}, B::Adjoint{..., CuArray}) leads to an explosion of methods. NNlib's dispatch on storage_type here is an attempt to avoid that (a possibly too-complicated one).

mcabbott added the CUDA label Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reshape of a view breaks batched_mul on GPU (scalar indexing happens) #466

Reshape of a view breaks batched_mul on GPU (scalar indexing happens) #466

reachtarunhere commented Jan 26, 2023

mcabbott commented Jan 27, 2023

ToucheSir commented Jan 27, 2023

mcabbott commented Jan 29, 2023

Reshape of a view breaks batched_mul on GPU (scalar indexing happens) #466

Reshape of a view breaks batched_mul on GPU (scalar indexing happens) #466

Comments

reachtarunhere commented Jan 26, 2023

mcabbott commented Jan 27, 2023

ToucheSir commented Jan 27, 2023

mcabbott commented Jan 29, 2023