Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Speed up dense-sparse matmul #38876

Merged
merged 5 commits into from
Jan 12, 2021
Merged

[WIP] Speed up dense-sparse matmul #38876

merged 5 commits into from
Jan 12, 2021

Conversation

dkarrasch
Copy link
Member

@dkarrasch dkarrasch commented Dec 14, 2020

This is a revival of the infamous #24045. Most multiplication codes were already in good shape. Two of the features tested out in that PR were @simding the innermost loop and using muladd. In a few cases, I found an unfortunate memory access pattern, and in one or two cases multiplication by alpha from the wrong side (matters only for non-commutative number types). Here's a quick benchmarking script:

using LinearAlgebra, SparseArrays, BenchmarkTools, Random
Random.seed!(1234)

A = randn(1000,1000);
B = sprandn(1000,1000, 0.05);
@btime $A*$B;
@btime $A'*$B;
@btime $A*$B';
@btime $A'*$B';
@btime $B*$A;
@btime $B'*$A;
@btime $B*$A';
@btime $B'*$A';

First respective timing is nightly, second is this PR:

julia> @btime $A*$B;
  82.415 ms (2 allocations: 7.63 MiB)
  16.712 ms (2 allocations: 7.63 MiB)

julia> @btime $A'*$B;
  67.930 ms (2 allocations: 7.63 MiB)
  69.376 ms (2 allocations: 7.63 MiB)

julia> @btime $A*$B';
  53.093 ms (2 allocations: 7.63 MiB)
  20.367 ms (2 allocations: 7.63 MiB)

julia> @btime $A'*$B';
  105.031 ms (2 allocations: 7.63 MiB)
  80.477 ms (2 allocations: 7.63 MiB)

julia> @btime $B*$A;
  56.173 ms (2 allocations: 7.63 MiB)
  51.509 ms (2 allocations: 7.63 MiB)

julia> @btime $B'*$A;
  51.708 ms (2 allocations: 7.63 MiB)
  33.794 ms (2 allocations: 7.63 MiB)

julia> @btime $B*$A';
  60.738 ms (2 allocations: 7.63 MiB)
  63.872 ms (2 allocations: 7.63 MiB)

julia> @btime $B'*$A';
  72.421 ms (2 allocations: 7.63 MiB)
  55.529 ms (2 allocations: 7.63 MiB)

@Sacha0's "old" tests seemed to indicate that, at the time, the muladd was not helpful and perhaps the @simd neither. I believe, however, that the big boosts seen here in A*B are both due to the reordering of loops and at-simd. Shall we have a nanosoldier run?

EDIT: updated timings. I see variations of about +/-2 ms, so the improvements here are due to reordering loops, and a new @simd annotation in one case. muladd doesn't seem to play any role here.

@dkarrasch dkarrasch added performance Must go faster domain:linear algebra Linear algebra domain:arrays:sparse Sparse arrays labels Dec 14, 2020
@ViralBShah
Copy link
Member

Yes, please do a nanosoldier run.

@dkarrasch
Copy link
Member Author

This is now carefully checked for the quadratic cases. From the pure type point of view, this is as good as it gets, I believe. But maybe nanosoldier will tell us that we should also include size considerations, let's see. I don't think I can kick-start it, can anybody help out here, please?

@oscardssmith
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs=":master")
Let's find out if I have the perms to do this.

@dkarrasch
Copy link
Member Author

Seems like it didn't work. Let me try:
@nanosoldier runbenchmarks(ALL, vs=":master")

@ViralBShah
Copy link
Member

I'm approving this since I had reviewed the original PR.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG

@dkarrasch

This comment has been minimized.

@dkarrasch dkarrasch changed the title Speed up dense-sparse matmul [WIP] Speed up dense-sparse matmul Dec 18, 2020
@dkarrasch
Copy link
Member Author

I think my local benchmarks were flawed. I'm now learning how to do it professionally with BaseBenchmarks.jl, and there seem to be some real regressions currently. Sorry for the noise. I'll need some time to try out a few things, rebuild and run the benchmarks again, so marked as WIP.

@ViralBShah
Copy link
Member

Can always convert to a draft PR.

@ViralBShah
Copy link
Member

I wonder if we have an update here, and if it might be ready to make it into 1.6.

@dkarrasch
Copy link
Member Author

Sorry, I confused/fooled myself a little bit, but now I think I see clearer. The benchmarks tell pretty clearly

  • that A_mul_B[q] for A dense and B sparse improved dramatically (factor 2 up to 8),
  • that A[q]_mul_B[q] for A sparse and B dense hasn't changed essentially,
  • that Aq_mul_Bq regressed.

The regression was probably due to my own modifications (I thought that rearranging the loops such that you walk through the dense factor in memory-optimal order would be beneficial, but it turns out that it's not). So I reverted that part. Let's have another (restricted) benchmark run.

By the way, this PR does not include the silent copy of transposed/adjoint dense factors. We could do that in the * methods, but I wouldn't in the mul!s. That, however, would give the strange situation that * is noticably faster than mul! at the expense, obviously, of extra allocations. We should discuss whether we want to do that elsewhere.

@nanosoldier runbenchmarks("sparse", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG

@dkarrasch
Copy link
Member Author

This looks pretty good!!! The remaining few regressions are related to methods that haven't been changed functionally, unless I'm missing something.

@ViralBShah
Copy link
Member

The regression was probably due to my own modifications (I thought that rearranging the loops such that you walk through the dense factor in memory-optimal order would be beneficial, but it turns out that it's not). So I reverted that part. Let's have another (restricted) benchmark run.

That is actually a bit surprising. I would have thought it would at least be as good.

Copy link
Member Author

@dkarrasch dkarrasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pointed out the two sources of improvements. The other methods are just polished, but shouldn't change behavior at all.

Comment on lines +124 to +127
@inbounds for col in 1:size(A, 2), k in nzrange(A, col)
Aiα = $t(nzv[k]) * α
rvk = rv[k]
@simd for multivec_col in 1:mX
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loop order is the same as before, but the two line have been hoisted out and the inner most loop @simded. According to nanosoldier, this results in a significant improvement.

Comment on lines +92 to +99
if X isa StridedOrTriangularMatrix
@inbounds for col in 1:size(A, 2), k in nzrange(A, col)
Aiα = nzv[k] * α
rvk = rv[k]
@simd for multivec_row in 1:mX
C[multivec_row, col] += X[multivec_row, rvk] * Aiα
end
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch is new, and loop order rearranged for optimal access pattern both in X and C, plus the @simd in the innermost loop. According to nanosoldier, this results in a significant improvement.

@dkarrasch
Copy link
Member Author

I'll do a last nanosoldier run. Absent surprises, this is good to go.

@nanosoldier runbenchmarks("sparse", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG

@dkarrasch
Copy link
Member Author

dkarrasch commented Jan 10, 2021

Ok, nanosoldier, I want it all green, that's why: last time, and I'll let you go.

@nanosoldier runbenchmarks("sparse", vs=":master")

@dkarrasch
Copy link
Member Author

@nanosoldier runbenchmarks("sparse", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG

@dkarrasch
Copy link
Member Author

Ok, how to proceed? Just merge? Are we going to include this in v1.6? I know it's very late in the release cycle, but it is purely addressing performance, no new features, no change in behavior.

@ViralBShah
Copy link
Member

I think it is good to merge.

@KristofferC Is it ok to backport? I'm tagging it for now.

@ViralBShah ViralBShah added the backport 1.6 Change should be backported to release-1.6 label Jan 12, 2021
@ViralBShah ViralBShah merged commit a3369df into master Jan 12, 2021
@ViralBShah ViralBShah deleted the dk/sparsemul branch January 12, 2021 15:01
@KristofferC KristofferC mentioned this pull request Jan 19, 2021
60 tasks
KristofferC pushed a commit that referenced this pull request Jan 19, 2021
* Speed up dense-sparse matmul

* add one at-simd, minor edits

* improve A_mul_Bq for dense-sparse

* revert ineffective changes

* shift at-inbounds annotation

(cherry picked from commit a3369df)
@KristofferC KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Feb 1, 2021
KristofferC pushed a commit that referenced this pull request Feb 1, 2021
* Speed up dense-sparse matmul

* add one at-simd, minor edits

* improve A_mul_Bq for dense-sparse

* revert ineffective changes

* shift at-inbounds annotation

(cherry picked from commit a3369df)
ElOceanografo pushed a commit to ElOceanografo/julia that referenced this pull request May 4, 2021
* Speed up dense-sparse matmul

* add one at-simd, minor edits

* improve A_mul_Bq for dense-sparse

* revert ineffective changes

* shift at-inbounds annotation
staticfloat pushed a commit that referenced this pull request Dec 23, 2022
* Speed up dense-sparse matmul

* add one at-simd, minor edits

* improve A_mul_Bq for dense-sparse

* revert ineffective changes

* shift at-inbounds annotation

(cherry picked from commit a3369df)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:arrays:sparse Sparse arrays domain:linear algebra Linear algebra performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants