mul! performance regression on master #34013

daviehh · 2019-12-03T22:59:22Z

On the current master branch, mul! of small matrices can be 10x slower than 1.3 (and also allocates)

julia> versioninfo()
Julia Version 1.4.0-DEV.556
Commit 4800158ef5* (2019-12-03 21:18 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.0.0)
  CPU: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

test code:

using LinearAlgebra
using BenchmarkTools

ndim = 3

m1 = rand(ComplexF64,ndim,ndim);
m2 = rand(ComplexF64,ndim,ndim);
ou = rand(ComplexF64,ndim,ndim);

@btime mul!($ou, $m1, $m2);

With the release-1.3 version,

33.471 ns (0 allocations: 0 bytes)

on master:

394.428 ns (1 allocation: 16 bytes)

While it's nano-seconds, when one has mul! in the inner-most/hot loop, this can easily translate to big performance degradation when most of the calculation is with such matrix productions. Larger matrices (e.g. ndim = 30) appears unaffected (dispatched to BLAS?) 2.732 μs (0 allocations: 0 bytes) on 1.3 and 2.774 μs (0 allocations: 0 bytes) on master.

The text was updated successfully, but these errors were encountered:

daviehh · 2019-12-04T02:08:29Z

Likely happened some time between mul!/gemm_wrapper! and matmul3x3!:

@btime LinearAlgebra.gemm_wrapper!($ou, $t, $t, $m1, $m2, 1.2, .5im)

388.228 ns (3 allocations: 80 bytes)

ad = LinearAlgebra.MulAddMul(1.2, 0.5im)
t = 'N'
@btime LinearAlgebra.matmul3x3!($ou, $t, $t, $m1, $m2, $ad)

48.857 ns (0 allocations: 0 bytes)

But there's not much between:

julia/stdlib/LinearAlgebra/src/matmul.jl

Lines 552 to 578 in 1a6baef

 function gemm_wrapper!(C::StridedVecOrMat{T}, tA::AbstractChar, tB::AbstractChar, 

 A::StridedVecOrMat{T}, B::StridedVecOrMat{T}, 

 α::Number=true, β::Number=false) where {T<:BlasFloat} 

 mA, nA = lapack_size(tA, A) 

 mB, nB = lapack_size(tB, B) 

 if nA != mB 

 throw(DimensionMismatch("A has dimensions ($mA,$nA) but B has dimensions ($mB,$nB)")) 

 end 

 if C === A || B === C 

 throw(ArgumentError("output matrix must not be aliased with input matrix")) 

 end 

 if mA == 0 || nA == 0 || nB == 0 || iszero(α) 

 if size(C) != (mA, nB) 

 throw(DimensionMismatch("C has dimensions $(size(C)), should have ($mA,$nB)")) 

 end 

 return _rmul_or_fill!(C, β) 

 end 

 if mA == 2 && nA == 2 && nB == 2 

 return matmul2x2!(C, tA, tB, A, B, MulAddMul(α, β)) 

 end 

 if mA == 3 && nA == 3 && nB == 3 

 return matmul3x3!(C, tA, tB, A, B, MulAddMul(α, β)) 

 end

t = 'N'
al = 1.2
bt = .5im
@btime LinearAlgebra.matmul3x3!($ou, $t, $t, $m1, $m2, LinearAlgebra.MulAddMul($al, $bt))
@btime LinearAlgebra.matmul3x3!($ou, $t, $t, $m1, $m2, LinearAlgebra.MulAddMul(1.2, .5im))

returns:

359.817 ns (3 allocations: 80 bytes)
48.784 ns (0 allocations: 0 bytes)

dkarrasch · 2019-12-04T14:27:16Z

So the issue seems to come from when the constructor MulAddMul is called, see #33743 (comment). In the release version, calling mul!(C, A, B) calls mul!(C, A, B, true, false), which does

alpha, beta = promote(α, β, zero(T))
    if alpha isa T && beta isa T
        return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul(alpha, beta))
    else
        return generic_matmatmul!(C, 'N', 'N', A, B, MulAddMul(α, β))
    end

In #33743, we construct the MulAddMul one layer below, i.e., inside the gemm_wrapper!, as suggested by @tkf. Unfortunately, the compiler does not propagate the constants, so the construction of the MulAddMul(true, false) does not happen at compile time, and then badly hits performance, as was already noticed by @Jutho in #29634 (comment). Still, #33743 was addressing other performance issues with adjoints/transposes and number types different from Bool and the matrix eltype. Is there any way to help constant propagation here?

daviehh · 2019-12-04T16:17:32Z

@dkarrasch Thanks for the comment, looks like julia thinks LinearAlgebra.MulAddMul(α, β) might not be type stable? Test with

using LinearAlgebra

ndim = 3

a = rand(ComplexF64,ndim,ndim);
b = rand(ComplexF64,ndim,ndim);
c = rand(ComplexF64,ndim,ndim);

t = 'N'
al = 1.2
bt = .5im

@code_warntype LinearAlgebra.gemm_wrapper!(c, t, t, a, b, al, bt)

dkarrasch · 2019-12-04T16:24:28Z

Now that you mention it, it's not surprising: the exact type (including its parameters) of MulAddMul is constructed based on the values of α and β, like iszero and isone. The return values of those two functions are AFAIK never determined by type. So maybe the issue is that by shifting the construction by one layer, we are potentially missing a function barrier that we have in the v1.3-release?

dkarrasch · 2019-12-04T16:28:39Z

No, actually, I think we do have function barriers, matmul2x2! and matmul3x3!, so it really seems to be the constant propagation issue, i.e., whether or not the MulAddMul object can be constructed at compile time.

dkarrasch · 2019-12-07T12:07:59Z

@daviehh Have you considered using StaticArrays.jl for the small matrix case?

using LinearAlgebra
using BenchmarkTools
using StaticArrays

ndim = 3

m1 = SMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
m2 = SMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
ou = MMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))

@btime mul!($ou, $m1, $m2)

daviehh · 2019-12-07T14:20:29Z

That can be one weird way to get around this issue... performance seems to be the reason why julia has special code for 2x2 and 3x3 matrices multiplication in the first place due to the overhead of sending matrices to BLAS.gemm!(). Maybe rethink when to construct MulAddMul / deal with const. propagation or function barriers for now?

I tried StaticArrays before, for one of my particular project, the code need to tackle both large and small matrices, and it can be messy to have separate code and keep them synced & maintained to support both standard and static arrays.

KristofferC · 2020-01-15T08:05:21Z

Bump, this needs to be fixed / worked-around with some priority.

tkf · 2020-01-15T09:00:11Z

It looks like a single @inline fixes it (by helping const prop, I guess). See #34384.

daviehh · 2020-01-15T15:26:33Z

Thanks for the fix! Since the 2x2 and 3x3 matrix multiplication is dealt with differently, can this be put in the nanosildier's benchmark script? possibly append 2 or 3 to the SIZES at

https://github.com/JuliaCI/BaseBenchmarks.jl/blob/406d5b1d11b498884e58e111d5745b29c9471d40/src/linalg/LinAlgBenchmarks.jl#L13

KristofferC · 2020-01-28T16:36:27Z

Putting triage since we need to decide between #34384, #34394 for 1.4. What do people think?

tkf · 2020-01-29T00:23:44Z

As I said in #34394 (comment), #34394 does not quite make sense to me since MulAddMul has the information about 0/1 in the type-domain to hoist the if-branch out of the loop. (OK, to be fair, there is _modify! for supporting BigFloat so this is not really true anymore.) If we go with #34394, I think it's better to just remove MulAddMul entirely (which of course can happen after #34394 to not block the release).

tkf · 2020-01-29T00:37:35Z

As a longer term solution, I think it's better to defer bringing 0'ness to the type domain as late as possible (which is the opposite of what we are doing ATM). I think something like this might work:

struct Zero <: Number end
*(::Zero, x) = zero(x)
*(x, ::Zero) = zero(x)

function mul_impl!(C, A, B, alpha, beta)
    iszero(beta) && !(beta isa Zero) && return mul_impl!(C, A, B, alpha, Zero())
    ...
end

dkarrasch · 2020-01-29T06:45:26Z

It seems that #34384 together with #29634 (comment) is the better solution. This should make everybody happy for the moment (i.e., for v1.4), since it makes 3- and 5-arg mul! for small matrices and multiplication of structured matrices performant. From there, we can think about further improvements. I think @daviehh suggested to add the 2x2 and 3x3 multiplication tests into the benchmark suite. That would help the future development.

tkf · 2020-01-29T08:05:36Z

I included #29634 (comment) in #34384

daviehh · 2020-01-29T16:08:02Z

Yes, #34384 together with #29634 (comment) is the better solution, thanks a lot! Also added the test for 3 and 5 arg mul! for 2x2 and 3x3 matrices in PR JuliaCI/BaseBenchmarks.jl#255

tkf · 2020-01-31T08:04:30Z

#34601 is yet another PR to fix for this. It doesn't add the @inline.

daviehh · 2020-02-02T22:46:44Z

@KristofferC Thanks! Sorry for pinging/nagging but can the corresponding nano-soldier script be updated to include the benchmark test for the 2x2 and 3x3 matrix productions? I've submitted a PR at JuliaCI/BaseBenchmarks.jl#255

KristofferC · 2020-02-03T08:57:02Z

Merged, but the Nanosoldier bot also need to update to use the latest BaseBenchmarks for it to take effect.

KristofferC mentioned this issue Dec 3, 2019

Backports release 1.3.1 #33979

Merged

18 tasks

fredrikekre added linear algebra Linear algebra performance Must go faster labels Dec 4, 2019

KristofferC added this to the 1.4 milestone Dec 4, 2019

KristofferC added the regression Regression in behavior compared to a previous version label Dec 4, 2019

tkf mentioned this issue Jan 15, 2020

Fix mul! performance regression for 2x2 and 3x3 matrices #34384

Closed

daviehh mentioned this issue Jan 15, 2020

alternative fix for mul! #34394

Closed

KristofferC added the triage This should be discussed on a triage call label Jan 28, 2020

daviehh mentioned this issue Jan 29, 2020

add test for 3- and 5-arg mul! with 2x2 and 3x3 matrices product JuliaCI/BaseBenchmarks.jl#255

Merged

tkf mentioned this issue Jan 31, 2020

Construct MulAddMul at gemm_wrapper! call sites #34601

Merged

KristofferC closed this as completed in #34601 Feb 1, 2020

Keno removed the triage This should be discussed on a triage call label Feb 27, 2020

rleegates mentioned this issue Jul 20, 2020

Perhaps this is why mul! allocates spacedome/FEASTSolver.jl#7

Open

amilsted mentioned this issue Oct 4, 2022

dense-matrix mul!(C, A, B, alpha, beta) allocates #46865

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mul! performance regression on master #34013

mul! performance regression on master #34013

daviehh commented Dec 3, 2019 •

edited

Loading

daviehh commented Dec 4, 2019 •

edited

Loading

dkarrasch commented Dec 4, 2019

daviehh commented Dec 4, 2019 •

edited

Loading

dkarrasch commented Dec 4, 2019

dkarrasch commented Dec 4, 2019

dkarrasch commented Dec 7, 2019

daviehh commented Dec 7, 2019

KristofferC commented Jan 15, 2020

tkf commented Jan 15, 2020

daviehh commented Jan 15, 2020

KristofferC commented Jan 28, 2020

tkf commented Jan 29, 2020

tkf commented Jan 29, 2020

dkarrasch commented Jan 29, 2020

tkf commented Jan 29, 2020

daviehh commented Jan 29, 2020

tkf commented Jan 31, 2020 •

edited

Loading

daviehh commented Feb 2, 2020

KristofferC commented Feb 3, 2020

mul! performance regression on master #34013

mul! performance regression on master #34013

Comments

daviehh commented Dec 3, 2019 • edited Loading

daviehh commented Dec 4, 2019 • edited Loading

dkarrasch commented Dec 4, 2019

daviehh commented Dec 4, 2019 • edited Loading

dkarrasch commented Dec 4, 2019

dkarrasch commented Dec 4, 2019

dkarrasch commented Dec 7, 2019

daviehh commented Dec 7, 2019

KristofferC commented Jan 15, 2020

tkf commented Jan 15, 2020

daviehh commented Jan 15, 2020

KristofferC commented Jan 28, 2020

tkf commented Jan 29, 2020

tkf commented Jan 29, 2020

dkarrasch commented Jan 29, 2020

tkf commented Jan 29, 2020

daviehh commented Jan 29, 2020

tkf commented Jan 31, 2020 • edited Loading

daviehh commented Feb 2, 2020

KristofferC commented Feb 3, 2020

daviehh commented Dec 3, 2019 •

edited

Loading

daviehh commented Dec 4, 2019 •

edited

Loading

daviehh commented Dec 4, 2019 •

edited

Loading

tkf commented Jan 31, 2020 •

edited

Loading