Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move reinterpret-based optimization for complex matrix * real vec/mat to lower level. #44052

Merged
merged 3 commits into from
Feb 9, 2022

Conversation

N5N3
Copy link
Member

@N5N3 N5N3 commented Feb 6, 2022

Inspired by #44011.
This has 2 advantages:

  1. reinterpret-based optimiztion is faster only for BLAS-compatible cases. This PR ensures that it does not cause regression.
  2. reinterpret(T, A::StridedArray) might not be a StridedArray. But we can still call BLAS api as we have strides defined.

The current test should cover all cases. (see #29246)
Edit: Test size is enlarged to 5x5 to enable BLAS fallback.

@N5N3 N5N3 requested a review from dkarrasch February 6, 2022 14:36
@N5N3 N5N3 added domain:linear algebra Linear algebra performance Must go faster labels Feb 6, 2022
@dkarrasch
Copy link
Member

LGTM. The linux32 test failure is concerning, as it seems to hang in the early linalg tests, one of which is the matmul tests touched here. I'll restart CI.

@dkarrasch dkarrasch closed this Feb 7, 2022
@dkarrasch dkarrasch reopened this Feb 7, 2022
@Moelf
Copy link
Sponsor Contributor

Moelf commented Feb 7, 2022

@dkarrasch
Copy link
Member

https://discourse.julialang.org/t/in-place-multiplication-is-too-much-slower-for-complexf64/75887/

while you're at it do you want to fix this too?

I guess that's taken care of by #44011?

@N5N3
Copy link
Member Author

N5N3 commented Feb 7, 2022

In general we don't want extra allocation when use mul!(C,A,B). Thus the convert-based optimization is applied only to A*B.
Unfortunately, BLAS's gemm only support inputs with dense column. (Thus #44011 only handles "mat * vec" case)
MKL has a extended api cblas_?gemv_batch_strided. Looks like we can use it to do gemm with non-dense column.
But I never used it.
Edit: have some try

using MKL_jll
const Lib = MKL_jll.libmkl_rt
import LinearAlgebra.BlasInt
import LinearAlgebra.BLAS: @blasfunc
for (gemv_batch_strided, elty) in ((:(:dgemv_batch_strided),:Float64))
    @eval function gemv_batch_strided!(transA::AbstractChar,
                alpha::Union{($elty), Bool},
                A::AbstractMatrix{$elty}, B::AbstractVecOrMat{$elty},
                beta::Union{($elty), Bool},
                C::AbstractVecOrMat{$elty})
        m, n = size(A)
        batch = size(C, 2)
        ccall(($gemv_batch_strided, Lib), Cvoid,
            (Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt}, Ref{$elty}, Ptr{$elty}, Ref{BlasInt}, Ref{BlasInt},
            Ptr{$elty}, Ref{BlasInt}, Ref{BlasInt}, Ref{$elty}, Ptr{$elty}, Ref{BlasInt}, Ref{BlasInt},Ref{BlasInt}),
            transA, m, n, alpha, A, max(1,stride(A,2)), 0, 
            B, stride(B, 1), stride(B, 2), beta, C, stride(C, 1), stride(C, 2), batch)
        C
    end
end

The performance is not good though:

julia> a = randn(1000, 1000);
julia> b = view(randn(2000, 1000), 1:2:2000, :);
julia> c = zeros(1000,1000);

julia> gemv_batch_strided!('N', true, a, b, false, c)  a * b
true

julia> @btime gemv_batch_strided!('N', true, $a, $b, false, $c);
  39.001 ms (0 allocations: 0 bytes) # MKL use 6 threads

julia> @btime mul!($c,$a,$b)
 687.478 ms (3 allocations: 30.75 KiB) # general matmul only use 1 thread

1. Enlarge size from 2x2 to 5x5 to enable BLAS fallback.
2. Test more strides. 1, 2, and -1 (-1 only for vector)
3. Only test Complex * Real.
@dkarrasch dkarrasch merged commit 0bdf24f into JuliaLang:master Feb 9, 2022
@N5N3 N5N3 deleted the more_limitation branch February 10, 2022 08:51
antoine-levitt pushed a commit to antoine-levitt/julia that referenced this pull request Feb 17, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:linear algebra Linear algebra performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants