Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mul! performance regression on master #34013

Closed
daviehh opened this issue Dec 3, 2019 · 19 comments · Fixed by #34601
Closed

mul! performance regression on master #34013

daviehh opened this issue Dec 3, 2019 · 19 comments · Fixed by #34601
Labels
linear algebra Linear algebra performance Must go faster regression Regression in behavior compared to a previous version
Milestone

Comments

@daviehh
Copy link
Contributor

daviehh commented Dec 3, 2019

On the current master branch, mul! of small matrices can be 10x slower than 1.3 (and also allocates)

julia> versioninfo()
Julia Version 1.4.0-DEV.556
Commit 4800158ef5* (2019-12-03 21:18 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.0.0)
  CPU: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

test code:

using LinearAlgebra
using BenchmarkTools

ndim = 3

m1 = rand(ComplexF64,ndim,ndim);
m2 = rand(ComplexF64,ndim,ndim);
ou = rand(ComplexF64,ndim,ndim);

@btime mul!($ou, $m1, $m2);

With the release-1.3 version,

33.471 ns (0 allocations: 0 bytes)

on master:

394.428 ns (1 allocation: 16 bytes)

While it's nano-seconds, when one has mul! in the inner-most/hot loop, this can easily translate to big performance degradation when most of the calculation is with such matrix productions. Larger matrices (e.g. ndim = 30) appears unaffected (dispatched to BLAS?) 2.732 μs (0 allocations: 0 bytes) on 1.3 and 2.774 μs (0 allocations: 0 bytes) on master.

@daviehh
Copy link
Contributor Author

daviehh commented Dec 4, 2019

Likely happened some time between mul!/gemm_wrapper! and matmul3x3!:

@btime LinearAlgebra.gemm_wrapper!($ou, $t, $t, $m1, $m2, 1.2, .5im)

388.228 ns (3 allocations: 80 bytes)

ad = LinearAlgebra.MulAddMul(1.2, 0.5im)
t = 'N'
@btime LinearAlgebra.matmul3x3!($ou, $t, $t, $m1, $m2, $ad)

48.857 ns (0 allocations: 0 bytes)

But there's not much between:

function gemm_wrapper!(C::StridedVecOrMat{T}, tA::AbstractChar, tB::AbstractChar,
A::StridedVecOrMat{T}, B::StridedVecOrMat{T},
α::Number=true, β::Number=false) where {T<:BlasFloat}
mA, nA = lapack_size(tA, A)
mB, nB = lapack_size(tB, B)
if nA != mB
throw(DimensionMismatch("A has dimensions ($mA,$nA) but B has dimensions ($mB,$nB)"))
end
if C === A || B === C
throw(ArgumentError("output matrix must not be aliased with input matrix"))
end
if mA == 0 || nA == 0 || nB == 0 || iszero(α)
if size(C) != (mA, nB)
throw(DimensionMismatch("C has dimensions $(size(C)), should have ($mA,$nB)"))
end
return _rmul_or_fill!(C, β)
end
if mA == 2 && nA == 2 && nB == 2
return matmul2x2!(C, tA, tB, A, B, MulAddMul(α, β))
end
if mA == 3 && nA == 3 && nB == 3
return matmul3x3!(C, tA, tB, A, B, MulAddMul(α, β))
end


t = 'N'
al = 1.2
bt = .5im
@btime LinearAlgebra.matmul3x3!($ou, $t, $t, $m1, $m2, LinearAlgebra.MulAddMul($al, $bt))
@btime LinearAlgebra.matmul3x3!($ou, $t, $t, $m1, $m2, LinearAlgebra.MulAddMul(1.2, .5im))

returns:

359.817 ns (3 allocations: 80 bytes)
48.784 ns (0 allocations: 0 bytes)

@fredrikekre fredrikekre added linear algebra Linear algebra performance Must go faster labels Dec 4, 2019
@KristofferC KristofferC added this to the 1.4 milestone Dec 4, 2019
@KristofferC KristofferC added the regression Regression in behavior compared to a previous version label Dec 4, 2019
@dkarrasch
Copy link
Member

So the issue seems to come from when the constructor MulAddMul is called, see #33743 (comment). In the release version, calling mul!(C, A, B) calls mul!(C, A, B, true, false), which does

alpha, beta = promote(α, β, zero(T))
    if alpha isa T && beta isa T
        return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul(alpha, beta))
    else
        return generic_matmatmul!(C, 'N', 'N', A, B, MulAddMul(α, β))
    end

In #33743, we construct the MulAddMul one layer below, i.e., inside the gemm_wrapper!, as suggested by @tkf. Unfortunately, the compiler does not propagate the constants, so the construction of the MulAddMul(true, false) does not happen at compile time, and then badly hits performance, as was already noticed by @Jutho in #29634 (comment). Still, #33743 was addressing other performance issues with adjoints/transposes and number types different from Bool and the matrix eltype. Is there any way to help constant propagation here?

@daviehh
Copy link
Contributor Author

daviehh commented Dec 4, 2019

@dkarrasch Thanks for the comment, looks like julia thinks LinearAlgebra.MulAddMul(α, β) might not be type stable? Test with

using LinearAlgebra

ndim = 3

a = rand(ComplexF64,ndim,ndim);
b = rand(ComplexF64,ndim,ndim);
c = rand(ComplexF64,ndim,ndim);

t = 'N'
al = 1.2
bt = .5im

@code_warntype LinearAlgebra.gemm_wrapper!(c, t, t, a, b, al, bt)

@dkarrasch
Copy link
Member

Now that you mention it, it's not surprising: the exact type (including its parameters) of MulAddMul is constructed based on the values of α and β, like iszero and isone. The return values of those two functions are AFAIK never determined by type. So maybe the issue is that by shifting the construction by one layer, we are potentially missing a function barrier that we have in the v1.3-release?

@dkarrasch
Copy link
Member

No, actually, I think we do have function barriers, matmul2x2! and matmul3x3!, so it really seems to be the constant propagation issue, i.e., whether or not the MulAddMul object can be constructed at compile time.

@dkarrasch
Copy link
Member

@daviehh Have you considered using StaticArrays.jl for the small matrix case?

using LinearAlgebra
using BenchmarkTools
using StaticArrays

ndim = 3

m1 = SMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
m2 = SMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
ou = MMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))

@btime mul!($ou, $m1, $m2)

@daviehh
Copy link
Contributor Author

daviehh commented Dec 7, 2019

That can be one weird way to get around this issue... performance seems to be the reason why julia has special code for 2x2 and 3x3 matrices multiplication in the first place due to the overhead of sending matrices to BLAS.gemm!(). Maybe rethink when to construct MulAddMul / deal with const. propagation or function barriers for now?

I tried StaticArrays before, for one of my particular project, the code need to tackle both large and small matrices, and it can be messy to have separate code and keep them synced & maintained to support both standard and static arrays.

@KristofferC
Copy link
Member

Bump, this needs to be fixed / worked-around with some priority.

@tkf
Copy link
Member

tkf commented Jan 15, 2020

It looks like a single @inline fixes it (by helping const prop, I guess). See #34384.

@daviehh
Copy link
Contributor Author

daviehh commented Jan 15, 2020

Thanks for the fix! Since the 2x2 and 3x3 matrix multiplication is dealt with differently, can this be put in the nanosildier's benchmark script? possibly append 2 or 3 to the SIZES at

https://github.com/JuliaCI/BaseBenchmarks.jl/blob/406d5b1d11b498884e58e111d5745b29c9471d40/src/linalg/LinAlgBenchmarks.jl#L13

@KristofferC KristofferC added the triage This should be discussed on a triage call label Jan 28, 2020
@KristofferC
Copy link
Member

Putting triage since we need to decide between #34384, #34394 for 1.4. What do people think?

@tkf
Copy link
Member

tkf commented Jan 29, 2020

As I said in #34394 (comment), #34394 does not quite make sense to me since MulAddMul has the information about 0/1 in the type-domain to hoist the if-branch out of the loop. (OK, to be fair, there is _modify! for supporting BigFloat so this is not really true anymore.) If we go with #34394, I think it's better to just remove MulAddMul entirely (which of course can happen after #34394 to not block the release).

@tkf
Copy link
Member

tkf commented Jan 29, 2020

As a longer term solution, I think it's better to defer bringing 0'ness to the type domain as late as possible (which is the opposite of what we are doing ATM). I think something like this might work:

struct Zero <: Number end
*(::Zero, x) = zero(x)
*(x, ::Zero) = zero(x)

function mul_impl!(C, A, B, alpha, beta)
    iszero(beta) && !(beta isa Zero) && return mul_impl!(C, A, B, alpha, Zero())
    ...
end

@dkarrasch
Copy link
Member

It seems that #34384 together with #29634 (comment) is the better solution. This should make everybody happy for the moment (i.e., for v1.4), since it makes 3- and 5-arg mul! for small matrices and multiplication of structured matrices performant. From there, we can think about further improvements. I think @daviehh suggested to add the 2x2 and 3x3 multiplication tests into the benchmark suite. That would help the future development.

@tkf
Copy link
Member

tkf commented Jan 29, 2020

I included #29634 (comment) in #34384

@daviehh
Copy link
Contributor Author

daviehh commented Jan 29, 2020

Yes, #34384 together with #29634 (comment) is the better solution, thanks a lot! Also added the test for 3 and 5 arg mul! for 2x2 and 3x3 matrices in PR JuliaCI/BaseBenchmarks.jl#255

@tkf
Copy link
Member

tkf commented Jan 31, 2020

#34601 is yet another PR to fix for this. It doesn't add the @inline.

@daviehh
Copy link
Contributor Author

daviehh commented Feb 2, 2020

@KristofferC Thanks! Sorry for pinging/nagging but can the corresponding nano-soldier script be updated to include the benchmark test for the 2x2 and 3x3 matrix productions? I've submitted a PR at JuliaCI/BaseBenchmarks.jl#255

@KristofferC
Copy link
Member

Merged, but the Nanosoldier bot also need to update to use the latest BaseBenchmarks for it to take effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
6 participants