Move `reinterpret`-based optimization for complex matrix * real vec/mat to lower level. #44052

N5N3 · 2022-02-06T14:36:41Z

Inspired by #44011.
This has 2 advantages:

reinterpret-based optimiztion is faster only for BLAS-compatible cases. This PR ensures that it does not cause regression.
reinterpret(T, A::StridedArray) might not be a StridedArray. But we can still call BLAS api as we have strides defined.

~~The current test should cover all cases. (see #29246)~~
Edit: Test size is enlarged to 5x5 to enable BLAS fallback.

dkarrasch · 2022-02-07T15:13:49Z

LGTM. The linux32 test failure is concerning, as it seems to hang in the early linalg tests, one of which is the matmul tests touched here. I'll restart CI.

Moelf · 2022-02-07T17:45:27Z

https://discourse.julialang.org/t/in-place-multiplication-is-too-much-slower-for-complexf64/75887/

while you're at it do you want to fix this too?

dkarrasch · 2022-02-07T18:05:29Z

https://discourse.julialang.org/t/in-place-multiplication-is-too-much-slower-for-complexf64/75887/

while you're at it do you want to fix this too?

I guess that's taken care of by #44011?

N5N3 · 2022-02-07T18:06:17Z

In general we don't want extra allocation when use mul!(C,A,B). Thus the convert-based optimization is applied only to A*B.
Unfortunately, BLAS's gemm only support inputs with dense column. (Thus #44011 only handles "mat * vec" case)
MKL has a extended api cblas_?gemv_batch_strided. Looks like we can use it to do gemm with non-dense column.
But I never used it. Edit: have some try

using MKL_jll
const Lib = MKL_jll.libmkl_rt
import LinearAlgebra.BlasInt
import LinearAlgebra.BLAS: @blasfunc
for (gemv_batch_strided, elty) in ((:(:dgemv_batch_strided),:Float64))
    @eval function gemv_batch_strided!(transA::AbstractChar,
                alpha::Union{($elty), Bool},
                A::AbstractMatrix{$elty}, B::AbstractVecOrMat{$elty},
                beta::Union{($elty), Bool},
                C::AbstractVecOrMat{$elty})
        m, n = size(A)
        batch = size(C, 2)
        ccall(($gemv_batch_strided, Lib), Cvoid,
            (Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt}, Ref{$elty}, Ptr{$elty}, Ref{BlasInt}, Ref{BlasInt},
            Ptr{$elty}, Ref{BlasInt}, Ref{BlasInt}, Ref{$elty}, Ptr{$elty}, Ref{BlasInt}, Ref{BlasInt},Ref{BlasInt}),
            transA, m, n, alpha, A, max(1,stride(A,2)), 0, 
            B, stride(B, 1), stride(B, 2), beta, C, stride(C, 1), stride(C, 2), batch)
        C
    end
end

The performance is not good though:

julia> a = randn(1000, 1000);
julia> b = view(randn(2000, 1000), 1:2:2000, :);
julia> c = zeros(1000,1000);

julia> gemv_batch_strided!('N', true, a, b, false, c) ≈ a * b
true

julia> @btime gemv_batch_strided!('N', true, $a, $b, false, $c);
  39.001 ms (0 allocations: 0 bytes) # MKL use 6 threads

julia> @btime mul!($c,$a,$b)
 687.478 ms (3 allocations: 30.75 KiB) # general matmul only use 1 thread

No more test needed

1. Enlarge size from 2x2 to 5x5 to enable BLAS fallback. 2. Test more strides. 1, 2, and -1 (-1 only for vector) 3. Only test Complex * Real.

…at to lower level. (JuliaLang#44052)

N5N3 requested a review from dkarrasch February 6, 2022 14:36

N5N3 added domain:linear algebra Linear algebra performance Must go faster labels Feb 6, 2022

dkarrasch closed this Feb 7, 2022

dkarrasch reopened this Feb 7, 2022

dkarrasch approved these changes Feb 8, 2022

View reviewed changes

N5N3 added 3 commits February 8, 2022 20:56

Limit complex * real optimization. (gemv)

0a5f799

Limit complex * real optimization. (gemm)

f153d26

No more test needed

Test fix

e583a60

1. Enlarge size from 2x2 to 5x5 to enable BLAS fallback. 2. Test more strides. 1, 2, and -1 (-1 only for vector) 3. Only test Complex * Real.

N5N3 force-pushed the more_limitation branch from 2e0912d to e583a60 Compare February 8, 2022 12:58

dkarrasch merged commit 0bdf24f into JuliaLang:master Feb 9, 2022

N5N3 deleted the more_limitation branch February 10, 2022 08:51

antoine-levitt pushed a commit to antoine-levitt/julia that referenced this pull request Feb 17, 2022

Move reinterpret-based optimization for complex matrix * real vec/m…

31774ab

…at to lower level. (JuliaLang#44052)

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022

Move reinterpret-based optimization for complex matrix * real vec/m…

042e7b5

…at to lower level. (JuliaLang#44052)

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022

Move reinterpret-based optimization for complex matrix * real vec/m…

f8489e9

…at to lower level. (JuliaLang#44052)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move `reinterpret`-based optimization for complex matrix * real vec/mat to lower level. #44052

Move `reinterpret`-based optimization for complex matrix * real vec/mat to lower level. #44052

N5N3 commented Feb 6, 2022 •

edited

Loading

dkarrasch commented Feb 7, 2022

Moelf commented Feb 7, 2022 •

edited

Loading

dkarrasch commented Feb 7, 2022

N5N3 commented Feb 7, 2022 •

edited

Loading

Move reinterpret-based optimization for complex matrix * real vec/mat to lower level. #44052

Move reinterpret-based optimization for complex matrix * real vec/mat to lower level. #44052

Conversation

N5N3 commented Feb 6, 2022 • edited Loading

dkarrasch commented Feb 7, 2022

Moelf commented Feb 7, 2022 • edited Loading

dkarrasch commented Feb 7, 2022

N5N3 commented Feb 7, 2022 • edited Loading

Move `reinterpret`-based optimization for complex matrix * real vec/mat to lower level. #44052

Move `reinterpret`-based optimization for complex matrix * real vec/mat to lower level. #44052

N5N3 commented Feb 6, 2022 •

edited

Loading

Moelf commented Feb 7, 2022 •

edited

Loading

N5N3 commented Feb 7, 2022 •

edited

Loading