Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding inplace multiplication for (unit)triangular matrices #36972

Merged
merged 8 commits into from
Nov 17, 2020

Conversation

mcognetta
Copy link
Contributor

@mcognetta mcognetta commented Aug 8, 2020

This PR closes #36828.

Adding in place mul! for (ε|Unit)(Upper|Lower)Triangular matrices.

Before, something like

julia> x = UpperTriangular(rand(3, 3))
3×3 UpperTriangular{Float64,Matrix{Float64}}:
 0.896774  0.0032823  0.133023
          0.842607   0.414514
                    0.98005

julia> A = UpperTriangular(rand(3, 3))
3×3 UpperTriangular{Float64,Matrix{Float64}}:
 0.42575  0.567843  0.19379
         0.137596  0.636028
                  0.959794

julia> mul!(x, A, A)

would fail with a method error, while

julia> mul!(rand(3, 3), A, A)

would work but destroy the underlying triangular structure.

Now,

julia> mul!(x, A, A)

works and maintains the triangular structure.

reverting an accidental doc change

reverting an accidental doc change

removing old version of function
@mcognetta
Copy link
Contributor Author

Gentle bump

@mcognetta
Copy link
Contributor Author

I was inspired by @bjack205 's JuliaCon talk (sorry for the random mention), so I changed a slice to a view to reduce allocations. It now allocates half has much as using a slice. Making the code

        view(C, :, i) = A*view(B, :, i)

seems to still allocate as much as the slicing. Hopefully someone with more experience with views will see this and explain.

@mcognetta
Copy link
Contributor Author

mcognetta commented Aug 31, 2020

I was able to reduce it to only 1 allocation using a preallocated vector (a la https://discourse.julialang.org/t/inplace-multiplication-by-a-square-matrix/1702/4), but in benchmarking I found the original version + @views to be the most performant.

For the original change:

julia> @btime mul!(x, A, A)
  10.779 ms (1000 allocations: 3.97 MiB)

For the first view change:

julia> @btime mul!(x, A, A)
  11.684 ms (1000 allocations: 3.97 MiB)

For the preallocated view change:

julia> @btime mul!(x, A, A)
  49.991 ms (1 allocation: 4.06 KiB)

For the @views change:

julia> @btime mul!(x, A, A)
  10.594 ms (500 allocations: 1.98 MiB)

@mcognetta
Copy link
Contributor Author

mcognetta commented Aug 31, 2020

Sorry, I am using this thread as my personal notebook. I can get it slightly faster (seems to scale at about 10% faster from my benchmarks) and with zero allocations using:

julia> function mul!(C::UpperTriangular, A::UpperTriangular, B::UpperTriangular)
           m, n = size(B, 1), size(B, 2)
           if m != size(A, 1)
               throw(DimensionMismatch("right hand side B needs first dimension of size $(size(A,1)), has size $m"))
           end
           if C === A || C === B
               throw(ArgumentError("output matrix must not be aliased with input matrix"))
           end
           @views for i in 1:n
               mul!(parent(C)[:, i], A, B[:, i])
           end
           return C
       end

Current version (slicing):

julia> @btime mul!(x, A, A)
  11.551 ms (1000 allocations: 3.97 MiB)

The above version:

julia> @btime mul!(x, A, A)
  9.076 ms (0 allocations: 0 bytes)

But this strikes me as hacky and highly dependant on triangular matrices being backed by a dense matrix and parent being an O(1) time operation. Without parent(C) I get a method error:

ERROR: MethodError: no method matching lmul!(::UpperTriangular{Float64,Matrix{Float64}}, ::SubArray{Float64,1,UpperTriangular{Float64,Matrix{Float64}},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},false})

which stems from

mul!(C::AbstractVector , A::AbstractTriangular, B::AbstractVector) = lmul!(A, copyto!(C, B))

This can probably be fixed, but seems like more work than what should be done here.

Pinging @dkarrasch since you seem to be active in this area. Any thoughts?

@oxinabox
Copy link
Contributor

The docstring of mul! says that normally one should extend 5 arg mul!, and 3 arg mul! will redispatch to that.

It seems like it would not be too hard to bring this PR inline with that.

@mcognetta
Copy link
Contributor Author

Not ready for review anymore. Adding the 5-argument method causes a big slowdown due to allocations. I am working on it now.

Comment on lines 750 to 752
@views for i in 1:n
C[:, i] = alpha*A*B[:, i] + beta*C[:, i]
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about

Suggested change
@views for i in 1:n
C[:, i] = alpha*A*B[:, i] + beta*C[:, i]
end
for (Bi, Ci) in zip(eachcol(B), eachcol(C))
mul!(Ci, A, Bi, alpha, beta)
end

eachcol, zip, and 5-arg mul! should all be essentially allocation free.

stdlib/LinearAlgebra/src/triangular.jl Outdated Show resolved Hide resolved
@testset "inplace mul of appropriate types should preserve triagular structure" begin
for elty1 in (Float32, Float64, ComplexF32, ComplexF64, Int)
for elty2 in (Float32, Float64, ComplexF32, ComplexF64, Int)
A = UpperTriangular(rand(elty1, 5, 5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rand(Int,...) may yield very large integers, which may overflow upon multiplication and addition.

@dkarrasch
Copy link
Member

dkarrasch commented Oct 12, 2020

For 3-arg mul!, there is another option to fix this:

function lmul!(A::UpperTriangular, B::UpperTriangular)
    lmul!(A, parent(B))
    return B
end

etc. This overwrites also the hidden part of parent(B), but hey, it's a mutating function.

EDIT: I forgot to erase the hidden terms, so one would have to run tril! or triu! on parent(B) first.

@dkarrasch
Copy link
Member

Ok, so I did some more testing.

  1. So my eachcol proposal did improve the performance significantly, but implementing the 5-arg mul! columnwise is not a good idea. For both small and large matrices, this is outperformed by the currently called generic_matmat_mul!. I'd suggest to simply remove those methods.

  2. For the 3-arg mul!, I strongly suggest to do (in the big loop)

@eval mul!(C::$cty, A::$aty, B::$bty) = (lmul!(A, copyto!(parent(C), B)); C)

because this has then the chance to call BLAS triangular multiplication. That (a) solves the issue, and (b) is faster than A*B, because you avoid the allocation of the array behind C.

@dkarrasch dkarrasch added the domain:linear algebra Linear algebra label Oct 16, 2020
@dkarrasch
Copy link
Member

Sorry, @mcognetta, I couldn't withstand pushing some changes and simplifying the tests. They were pretty heavy. Now, the tests also check for the correct output type.

@dkarrasch
Copy link
Member

Bump?

@mcognetta
Copy link
Contributor Author

@dkarrasch sorry for the delay. Thanks for your changes. It seems they are substantially faster than what I tried earlier. I don't think there is much else to do?

@mcognetta
Copy link
Contributor Author

Added a few tests for the 5-arg version.

@dkarrasch
Copy link
Member

There is the gc-failure, which I'm afraid of ignoring. I thought rebasing onto current master should help.

@oxinabox
Copy link
Contributor

oxinabox commented Nov 2, 2020

Since this is changing 3 arg mul! and not 5 arg mul! do we need to add a condition into the 5-arg mul! to redispatch mul!(C, A, B, false, true) to use it?

@dkarrasch
Copy link
Member

There are two strange errors for which tests pass, though, and two clearly unrelated test failures that are known and tracked. I suggest to finally merge this.

@mcognetta
Copy link
Contributor Author

Shall we close this since it's merged?

@dkarrasch
Copy link
Member

Wait, no, it's not merged yet. I was seeing if anyone raised objections.

@dkarrasch dkarrasch merged commit ef1b6d3 into JuliaLang:master Nov 17, 2020
@mcognetta mcognetta deleted the inplace_triangular_mul branch November 17, 2020 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No inplace multiplication of upper triangulars
3 participants