Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add generalized dot product #32739

Merged
merged 35 commits into from
Sep 5, 2019
Merged

add generalized dot product #32739

merged 35 commits into from
Sep 5, 2019

Conversation

dkarrasch
Copy link
Member

@dkarrasch dkarrasch commented Jul 30, 2019

I went ahead and added a generalized dot product dot(x, A, y), as discussed in #16573. I shamelessly copied (and slightly adjusted) @simonbyrne's code for the completely sparse case #16573 (comment) and the dense case #16573 (comment), and added methods for structured matrices and sparse matrices/generic vectors. There are presumably a couple of things to do:

  • add tests

  • NEWS entry (I'm not sure, but I guess we would want to announce this feature)

  • further optimizations (edit: I tried to reduce memory access similar to the corresponding multiplication functions, such that I hope that avoiding allocating the intermediate multiplication result will pay off)

I'm not an expert in @simd and these kind of enhancements, so I'm happy to learn whether one would expect improvements by using such features.

This is written with the option of block matrices in mind, and that should be tested obviously.Since we need zero of eltypes for type promotion, how would one do that? We don't have StaticArrays to our disposal.

If I'm not mistaken, then this would close #16573. I don't see how one would leverage symmetry of A, or the fact that x === y.

@ararslan ararslan added the domain:linear algebra Linear algebra label Jul 30, 2019
@chethega
Copy link
Contributor

chethega commented Jul 31, 2019

I think we also want specializations for Transpose and Adjoint for better memory access patterns.

For example dot(x, A::Adjoint, y) = dot(y, A', x)' and something for dot(x, a::Transpose, y). I think we should write out the code instead of trying to be clever like above, especially since we probably need to write something for the transpose case anyway. I have not thought about possible ambiguity errors yet, so don't trust the above signature.

@stevengj
Copy link
Member

stevengj commented Jul 31, 2019

I think we also want specializations for Transpose and Adjoint for better memory access patterns.

This wouldn't hurt, but I'm not clear on how useful it would be. For any dot(x, A, y) that defines a true inner product you should have A == A', so there's not much point in doing dot(x, A', y).

@dkarrasch
Copy link
Member Author

@stevengj I had true inner products in mind when I had the square assumption earlier, but I guess we are implementing here the machinery for ternary multiplication in general, in which there is no square and no symmetry assumption?

@stevengj
Copy link
Member

Sure, the dot(x, A, y) function should work for general A. I'm just not sure how much trouble we should take to optimize nonsymmetric cases.

@stevengj
Copy link
Member

Probably should also have a specialized implementation for UniformScaling so that dot(x, I, y) calls dot(x,y)

@thchr
Copy link
Contributor

thchr commented Jul 31, 2019

Given the proposed function signature, this should probably use eachindex and axes rather than 1:n and size; otherwise, access will in fact not be inbounds if supplied with e.g. an OffsetArray, leading to arbitrary answers/segfaults.
For the general signature, the fix could be something like:

function dot(x::AbstractVector, A::AbstractMatrix, y::AbstractVector)
    (axes(x)..., axes(y)...) == axes(A) || throw(DimensionMismatch())

    T = promote_type(promote_type(eltype(x), eltype(A)), eltype(y))
    s = zero(T)
    @inbounds for j in eachindex(y)
        yj = y[j]
        temp = zero(T)
        @simd for i in eachindex(x)
            temp += adjoint(x[i]) * A[i,j]
        end
        s += temp * yj
    end
    return s
end

(The above also removes some of the early-termination calls; they don't really seem warranted to me)

Irregardless of the above, I noticed that this is not doing better than ordinary x'*A*y, time-wise, for large input. Specifically, if

x′ = rand(N); A′ = rand(N,M); y′ = rand(M)

Then, for N = M = 1000

@btime adjoint($x′)*$A′*$y′  # 193.201 μs (1 allocation: 7.94 KiB)
@btime dot($x′,$A′,$y′)      # 312.699 μs (0 allocations: 0 bytes)

Though it does better for (N,M) less than ~700. Still, there might be room for improvement.

More timings ...

For N = M = 2000:

@btime adjoint($x′)*$A′*$y′  # 1.340 ms (1 allocation: 15.75 KiB)
@btime dot($x′,$A′,$y′)      # 1.615 ms (0 allocations: 0 bytes)

For N = M = 4000

@btime adjoint($x′)*$A′*$y′  # 6.418 ms (2 allocations: 31.33 KiB)
@btime dot($x′,$A′,$y′)      # 7.083 ms (0 allocations: 0 bytes)

@dkarrasch
Copy link
Member Author

@thchr Thanks for your comments. You're right, the generic one will need careful optimization, because it competes with BLAS multiplications, which has some overhead due to the allocation of the intermediate result. For very large problems, as you show, this does not outweigh the performance advantage of BLAS multiplication. So, we will need to make sure we catch up with the pure Julia multiplication.

@dkarrasch
Copy link
Member Author

I got started with some tests for the generic case. More tests will follow, and maybe the structured versions also need a little update. I was including the above comments into the generic version. This is how it looks like now:

function dot(x::AbstractVector, A::AbstractMatrix, y::AbstractVector)
    (axes(x)..., axes(y)...) == axes(A) || throw(DimensionMismatch())
    T = typeof(dot(first(x), first(A), first(y)))
    zeroxA = zero(adjoint(first(x))*first(A))
    s = zero(T)
    @inbounds for j in eachindex(y)
        yj = y[j]
        if !iszero(yj)
            temp = zeroxA
            @simd for i in eachindex(x)
                temp += adjoint(x[i]) * A[i,j]
            end
            s += temp * yj
        end
    end
    return s
end

There is a little issue with this. I wrote it to make it work for both number and matrix eltypes. In the block matrix case, the generic 3-arg dot allocates each time it adds something to temp. I thought I'll add a corresponding version where the eltypes of x, A, and y are Numbers, but this introduces ambiguity. So I could restrict to number types and simplify the type stuff, or need to find a way to distinguish these cases and have temp += adjoint(x[i]) * A[i,j] not allocate in the block matrix case.

dkarrasch referenced this pull request in JuliaManifolds/Manifolds.jl Aug 1, 2019
@dkarrasch
Copy link
Member Author

Tests are passing! I won't have much/any time the upcoming weeks (due to exams, holidays 🏖 etc), so I'd like to suggest to keep this PR at the current level (modulo minor improvements). The tests cover the number eltype cases and pass. I think I wrote the code such that it should work with block matrices/vectors. However, I'd like to keep testing (and potentially required adjustments/fixes) for an independent PR. Compared to my initial commits, I have also taken care of minimizing memory access whenever possible.

Regarding the fully dense version and the comparison to plain multiplication, I'm fine with putting x'A*y as a fallback for now, or a simple size-dependent distinction. The goal here was to introduce dot with the 3-arg signature, to direct *(::Adjoint,::Matrix,::Vector) to it, and to provide implementations for the sparse and structured cases, for which interest was expressed by different people. Optimizing the fully dense version may require thorough investigation and benchmarking, and seems to me like warrant a separate PR.

If there is no hurry, I'm also fine with developing the mentioned features in this PR further as time permits.

@StefanKarpinski StefanKarpinski added the status:forget me not PRs that one wants to make sure aren't forgotten label Aug 2, 2019
@stevengj
Copy link
Member

stevengj commented Aug 3, 2019

I think if the user calls dot(x, A, y), then we should assume that they want an allocation-free algorithm even if it is slightly slower.

function dot(x::AbstractVector, A::AbstractMatrix, y::AbstractVector)
(axes(x)..., axes(y)...) == axes(A) || throw(DimensionMismatch())
T = typeof(dot(first(x), first(A), first(y)))
zeroxA = zero(adjoint(first(x))*first(A))
Copy link
Contributor

@thchr thchr Aug 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

T = promote_op(dot, eltype(x), eltype(A), eltype(y))

and similar for zeroxA might be better since that gets inferred (the current formulation isn't, at least on v1.1.1; appears to incur some array-size testing so far as I can tell from @code_warntype).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know. These two lines are written such that the code can handle block matrices/vectors, so eltypes like Vector or Matrix. The issue is that calling one or zero on them fails, because the type does not contain any information on the size. And also, the same "first" approach is taken in the

function dot(x::AbstractArray, y::AbstractArray)

method a few lines above. Of course, this approach still breaks down when we have blocks of varying lengths, so this may not be a good idea either. To handle that, we would need to initialize temp in every loop separately. I'll push a suggestion for how to fix that.

@dkarrasch
Copy link
Member Author

The one failing test "wasn't me". So this seems to be in rather good shape, I'd say. Is it realistic to include this into Julia v1.3? I think it would be nice to have that feature (if it is complete), and optimize it later if necessary.

I have consistently interpreted dot(x, A, y) as dot(A'x, y), which in most cases is nothing but dot(x, A*y) anyway, except for non-associative numbers. I felt that the interpretation as dot(A'x, y) really expresses what we're doing: (i) computing some dot, which acts recursively, in contrast to ternary *, (ii) we first multiply x and A, and only then with elements from y, (iii) but makes obvious what the implementation difference must be to dot(A'x, y), i.e., not storing A'x.

I have covered all special matrices (bidiag, tridiags, diagonal, uniformscaling, symmetric/hermitian, triangular), what is missing is (Upper)Hessenberg, if I haven't overlooked anything else. Tests cover the Number eltype cases, the generic version (but no other) is also tested for block matrices. The hermitian/symmetric case is slightly suboptimal in that we go over the diagonal twice too often, but I felt that writing out all 4 cases with :U, :L, x===y and x !== y would lead to "code bloat", as @stevengj would say.

@dkarrasch dkarrasch changed the title WIP: add generalized dot product add generalized dot product Aug 8, 2019
@dkarrasch
Copy link
Member Author

I'm close to pulling the plug for a while, so don't be surprised to see no response. UpperHessenberg now also has its own ternary dot method.

I'd be very happy to find the feature in Julia v1.3, of which feature freeze is in one week, or so?

@StefanKarpinski
Copy link
Sponsor Member

Correct: feature freeze is slated for Aug 15th.

@antoine-levitt
Copy link
Contributor

I think if the user calls dot(x, A, y), then we should assume that they want an allocation-free algorithm even if it is slightly slower.

Not sure this is a correct assumption; If I see that there's a specialized implementation for this, I'm going to assume that it's faster than the standard ones, or people wouldn't have bothered writing a specialized implementation. Especially in the common case where x and y are vectors and A is a dense matrix, the allocation of a temporary vector doesn't matter much. Also BLAS might use threading and therefore be (potentially much) faster. At least if this method has the potential to be slower than dot(x,A*y) it should be documented as such.

@stevengj stevengj added the needs news A NEWS entry is required for this change label Aug 16, 2019
@dkarrasch
Copy link
Member Author

I have tried to improve/fix a couple of things quickly via the web interface; thanks @stevengj for the review. I have also created a NEWS entry. Unfortunately, there is a little merge conflict which I don’t know how to resolve via the web interface. If someone could rebase, that would be much appreciated.

@andreasnoack andreasnoack merged commit 2425ae7 into JuliaLang:master Sep 5, 2019
@StefanKarpinski
Copy link
Sponsor Member

Man, who knew that was going to take so long? Thanks for sticking it out, @dkarrasch!

@pkofod
Copy link
Contributor

pkofod commented Sep 5, 2019

This request originally came from @cortner's work to add preconditioners to Optim.jl! Can't believe that was in 2016.

@dkarrasch
Copy link
Member Author

Well, thanks, and thanks also to the reviewers for their patience and care, especially @stevengj!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:linear algebra Linear algebra needs news A NEWS entry is required for this change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extending inner products