Add simple 3- and 4-arg `*` methods #919

mcabbott · 2021-06-07T23:27:15Z

This PR repairs a small slowdown due to JuliaLang/julia#37898. When all matrices are the same size, that wastes a few ns thinking about their sizes, while this PR restores the left-to-right behaviour:

julia> s1,s2,s3,s4,s5 = fill(3,5);

julia> a=@SMatrix rand(s1,s2); b=@SMatrix rand(s2,s3); c=@SMatrix rand(s3,s4); d=@SMatrix rand(s4,s5);

julia> @btime *($(Ref(a))[],$(Ref(b))[],$(Ref(c))[],$(Ref(d))[]);  # after 37898, this thinks about the sizes
  9.291 ns (0 allocations: 0 bytes)

julia> @btime (($(Ref(a))[] * $(Ref(b))[]) * $(Ref(c))[]) * $(Ref(d))[];  # simply left to right
  5.208 ns (0 allocations: 0 bytes)

When the matrices are of different sizes, things aren't so clear-cut. Sometimes the re-ordering does speed things up, sometimes it doesn't.

mateuszbaran · 2021-06-08T06:12:36Z

I think ideally 3- and 4-arg methods of * would decide during compilation what the best order is, using generated functions and your cost model from that PR. Could you add that or do you need some help?

mateuszbaran · 2021-06-08T06:17:39Z

src/matrix_multiply.jl

@@ -15,11 +15,15 @@ import LinearAlgebra: BlasFloat, matprod, mul!
 @inline *(A::StaticArray{Tuple{N,1},<:Any,2}, B::Adjoint{<:Any,<:StaticVector}) where {N} = vec(A) * B
 @inline *(A::StaticArray{Tuple{N,1},<:Any,2}, B::Transpose{<:Any,<:StaticVector}) where {N} = vec(A) * B

+# Avoid LinearAlgebra._quad_matmul's order calculation on equal sizes
+@inline *(A::StaticMatrix{N,N}, B::StaticMatrix{N,N}, C::StaticMatrix{N,N}) where {N} = (A*B)*C


You could also use StaticMatMulLike here to be a bit more generic, although currently StaticMatMulLike doesn't swap sizes for adjoints and transposes so that would have to be fixed when working on the cost model.

I see that StaticArrays.StaticMatMulLike{3,10} is a union of three different sizes, is that deliberate or an oversight? Seems like it really ought to be StaticSquareLike{3} for the all the square ones, and StaticMatLike{3,10} allowing non-square but always size(A) == (3,10).

Ultimately I ended up not using the size part of StaticMatMulLike so the fact that s1 and s2 don't correspond to sizes is essentially an oversight.

mateuszbaran · 2021-06-08T07:23:25Z

BTW, I think you should reduce the cost of multiplying by diagonal matrices (and probably also things like bidiagonal or tridiagonal) in that PR to base.

mcabbott · 2021-06-08T13:09:14Z

The base PR doesn't do anything for special matrices, since they are all square. But the cost of checking their sizes unnecessarily is usually negligible -- only tiny SMatrices are fast enough that it matters, which is the case this PR wants to fix. I agree that things like Diagonal * Diagonal * Diagonal could be further optimised, there's quite a large set of possibilities and I haven't given any thought to how that might work.

I didn't know about this, but StaticArrays.StaticMatMulLike{N,N} looks like a union of static square things which could be used in place of StaticMatrix{N,N} here.

For non-square matrices, I agree you could pull the cost calculation into a generated function. I think I'm going to leave that for another PR, or for someone else if they need it. This one just wants to avoid slowing down a common case.

dkarrasch · 2021-06-08T13:24:13Z

I'm a bit surprised the cost computation isn't compiled away, actually, given that sizes of SMatrices are known at compile time and that this is all that is required to compute the costs.

mateuszbaran · 2021-06-08T13:27:40Z

Yes, right, optimization of computation of products of two small matrices was complicated enough, and developing a reliable set of heuristics for products of more matrices would be even harder.

Anyway, it's indeed interesting that constant propagation didn't manage to simplify size checks here.

mcabbott · 2021-06-08T13:35:36Z

I was also surprised it didn't get compiled away. Maybe it just needs more things inlined, if I try @eval LinearAlgebra @inline function _quad_matmul(A,B,C,D) ... then the time difference in the first message above goes away. Maybe that's a better idea than this PR!

Even with that change, the re-ordering doesn't always speed up SMatrix chains. Here's my test, sometimes one is faster, sometimes the other. Unless the cost calculation is taking as long as the matrix multiplication, it seems that this shows that the optimal order isn't in fact optimal -- so a different cost model would be better for SMatrices?

julia> using StaticArrays, Random

julia> s1,s2,s3,s4,s5 = shuffle([2,3,5,7,11])
5-element Vector{Int64}:
  2
  3
  7
  5
 11

julia> a=@SMatrix rand(s1,s2); b=@SMatrix rand(s2,s3); c=@SMatrix rand(s3,s4); d=@SMatrix rand(s4,s5);

julia> @btime *($(Ref(a))[],$(Ref(b))[],$(Ref(c))[],$(Ref(d))[]);  # after 37898, this thinks about the sizes
  29.648 ns (0 allocations: 0 bytes)

julia> @btime (($(Ref(a))[] * $(Ref(b))[]) * $(Ref(c))[]) * $(Ref(d))[];  # simply left to right
  14.640 ns (0 allocations: 0 bytes)

mateuszbaran · 2021-06-08T13:54:28Z

So maybe let's no worry too much now about reordering of multiplication for static matrices? Just eliminating run-time size check would definitely be enough.

if I try @eval LinearAlgebra @inline function _quad_matmul(A,B,C,D) ... then the time difference in the first message above goes away. Maybe that's a better idea than this PR!

That indicates that the overhead may be just from the single additional call, hard to tell without looking at @code_typed. I think _quad_matmul can be safely @inline since it's just called in one place in a non-@inline function.

mcabbott · 2021-06-08T14:35:04Z

The @inline seems safe, but besides solving the above benchmark, I wonder whether it would solve real-world equivalents? Or whether * not inlining would reproduce the problem?

Just eliminating run-time size check

Whether the existing size check is a good idea overall or not will depend on the distribution of non-square cases in "overall". My shuffle([2,3,5,7,11]) above is sort-of designed to produce marginal cases (and I picked a bad run); if we think the real world has more 10x1 matrices etc. then probably the check will pay for itself. (It would of course be easy to remove it entirely, e.g. by overloading _quad_matmul)

mateuszbaran reviewed Jun 8, 2021

View reviewed changes

Michael Abbott and others added 2 commits December 22, 2021 09:12

add simple 3-arg and 4-arg * methods

9c1cd18

widen to StaticMatMulLike

fbea544

mcabbott force-pushed the trimul branch from 8087c94 to fbea544 Compare December 22, 2021 14:12

mcabbott mentioned this pull request Jan 6, 2022

Avoid underflow and overflow in norm() #975

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add simple 3- and 4-arg `*` methods #919

Add simple 3- and 4-arg `*` methods #919

mcabbott commented Jun 7, 2021

mateuszbaran commented Jun 8, 2021

mateuszbaran Jun 8, 2021

mcabbott Jun 8, 2021

mateuszbaran Jun 8, 2021

mateuszbaran commented Jun 8, 2021

mcabbott commented Jun 8, 2021

dkarrasch commented Jun 8, 2021

mateuszbaran commented Jun 8, 2021

mcabbott commented Jun 8, 2021

mateuszbaran commented Jun 8, 2021

mcabbott commented Jun 8, 2021 •

edited

Loading

Add simple 3- and 4-arg * methods #919

Are you sure you want to change the base?

Add simple 3- and 4-arg * methods #919

Conversation

mcabbott commented Jun 7, 2021

mateuszbaran commented Jun 8, 2021

mateuszbaran Jun 8, 2021

Choose a reason for hiding this comment

mcabbott Jun 8, 2021

Choose a reason for hiding this comment

mateuszbaran Jun 8, 2021

Choose a reason for hiding this comment

mateuszbaran commented Jun 8, 2021

mcabbott commented Jun 8, 2021

dkarrasch commented Jun 8, 2021

mateuszbaran commented Jun 8, 2021

mcabbott commented Jun 8, 2021

mateuszbaran commented Jun 8, 2021

mcabbott commented Jun 8, 2021 • edited Loading

Add simple 3- and 4-arg `*` methods #919

Add simple 3- and 4-arg `*` methods #919

mcabbott commented Jun 8, 2021 •

edited

Loading