[WIP] Speed up dense-sparse matmul #38876

dkarrasch · 2020-12-14T18:42:09Z

This is a revival of the infamous #24045. Most multiplication codes were already in good shape. Two of the features tested out in that PR were @simding the innermost loop and using muladd. In a few cases, I found an unfortunate memory access pattern, and in one or two cases multiplication by alpha from the wrong side (matters only for non-commutative number types). Here's a quick benchmarking script:

using LinearAlgebra, SparseArrays, BenchmarkTools, Random
Random.seed!(1234)

A = randn(1000,1000);
B = sprandn(1000,1000, 0.05);
@btime $A*$B;
@btime $A'*$B;
@btime $A*$B';
@btime $A'*$B';
@btime $B*$A;
@btime $B'*$A;
@btime $B*$A';
@btime $B'*$A';

First respective timing is nightly, second is this PR:

julia> @btime $A*$B;
  82.415 ms (2 allocations: 7.63 MiB)
  16.712 ms (2 allocations: 7.63 MiB)

julia> @btime $A'*$B;
  67.930 ms (2 allocations: 7.63 MiB)
  69.376 ms (2 allocations: 7.63 MiB)

julia> @btime $A*$B';
  53.093 ms (2 allocations: 7.63 MiB)
  20.367 ms (2 allocations: 7.63 MiB)

julia> @btime $A'*$B';
  105.031 ms (2 allocations: 7.63 MiB)
  80.477 ms (2 allocations: 7.63 MiB)

julia> @btime $B*$A;
  56.173 ms (2 allocations: 7.63 MiB)
  51.509 ms (2 allocations: 7.63 MiB)

julia> @btime $B'*$A;
  51.708 ms (2 allocations: 7.63 MiB)
  33.794 ms (2 allocations: 7.63 MiB)

julia> @btime $B*$A';
  60.738 ms (2 allocations: 7.63 MiB)
  63.872 ms (2 allocations: 7.63 MiB)

julia> @btime $B'*$A';
  72.421 ms (2 allocations: 7.63 MiB)
  55.529 ms (2 allocations: 7.63 MiB)

@Sacha0's "old" tests seemed to indicate that, at the time, the muladd was not helpful and perhaps the @simd neither. I believe, however, that the big boosts seen here in A*B are both due to the reordering of loops and at-simd. Shall we have a nanosoldier run?

EDIT: updated timings. I see variations of about +/-2 ms, so the improvements here are due to reordering loops, and a new @simd annotation in one case. muladd doesn't seem to play any role here.

ViralBShah · 2020-12-15T01:29:49Z

Yes, please do a nanosoldier run.

dkarrasch · 2020-12-15T21:48:32Z

This is now carefully checked for the quadratic cases. From the pure type point of view, this is as good as it gets, I believe. But maybe nanosoldier will tell us that we should also include size considerations, let's see. I don't think I can kick-start it, can anybody help out here, please?

oscardssmith · 2020-12-15T21:50:34Z

@nanosoldier runbenchmarks(ALL, vs=":master")
Let's find out if I have the perms to do this.

dkarrasch · 2020-12-17T08:09:09Z

Seems like it didn't work. Let me try:
@nanosoldier runbenchmarks(ALL, vs=":master")

ViralBShah · 2020-12-17T15:48:27Z

I'm approving this since I had reviewed the original PR.

nanosoldier · 2020-12-17T16:45:49Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG

dkarrasch · 2020-12-18T17:48:04Z

I think my local benchmarks were flawed. I'm now learning how to do it professionally with BaseBenchmarks.jl, and there seem to be some real regressions currently. Sorry for the noise. I'll need some time to try out a few things, rebuild and run the benchmarks again, so marked as WIP.

ViralBShah · 2020-12-18T19:46:11Z

Can always convert to a draft PR.

ViralBShah · 2021-01-08T02:04:16Z

I wonder if we have an update here, and if it might be ready to make it into 1.6.

dkarrasch · 2021-01-08T17:20:14Z

Sorry, I confused/fooled myself a little bit, but now I think I see clearer. The benchmarks tell pretty clearly

that A_mul_B[q] for A dense and B sparse improved dramatically (factor 2 up to 8),
that A[q]_mul_B[q] for A sparse and B dense hasn't changed essentially,
that Aq_mul_Bq regressed.

The regression was probably due to my own modifications (I thought that rearranging the loops such that you walk through the dense factor in memory-optimal order would be beneficial, but it turns out that it's not). So I reverted that part. Let's have another (restricted) benchmark run.

By the way, this PR does not include the silent copy of transposed/adjoint dense factors. We could do that in the * methods, but I wouldn't in the mul!s. That, however, would give the strange situation that * is noticably faster than mul! at the expense, obviously, of extra allocations. We should discuss whether we want to do that elsewhere.

@nanosoldier runbenchmarks("sparse", vs=":master")

nanosoldier · 2021-01-08T19:12:58Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG

dkarrasch · 2021-01-08T21:47:28Z

This looks pretty good!!! The remaining few regressions are related to methods that haven't been changed functionally, unless I'm missing something.

ViralBShah · 2021-01-08T21:49:59Z

The regression was probably due to my own modifications (I thought that rearranging the loops such that you walk through the dense factor in memory-optimal order would be beneficial, but it turns out that it's not). So I reverted that part. Let's have another (restricted) benchmark run.

That is actually a bit surprising. I would have thought it would at least be as good.

dkarrasch

I pointed out the two sources of improvements. The other methods are just polished, but shouldn't change behavior at all.

dkarrasch · 2021-01-10T12:55:05Z

stdlib/SparseArrays/src/linalg.jl

+ @inbounds for col in 1:size(A, 2), k in nzrange(A, col)
+ Aiα = $t(nzv[k]) * α
+ rvk = rv[k]
+ @simd for multivec_col in 1:mX


Loop order is the same as before, but the two line have been hoisted out and the inner most loop @simded. According to nanosoldier, this results in a significant improvement.

dkarrasch · 2021-01-10T12:56:32Z

stdlib/SparseArrays/src/linalg.jl

+ if X isa StridedOrTriangularMatrix
+ @inbounds for col in 1:size(A, 2), k in nzrange(A, col)
+ Aiα = nzv[k] * α
+ rvk = rv[k]
+ @simd for multivec_row in 1:mX
+ C[multivec_row, col] += X[multivec_row, rvk] * Aiα
+ end
+ end


This branch is new, and loop order rearranged for optimal access pattern both in X and C, plus the @simd in the innermost loop. According to nanosoldier, this results in a significant improvement.

dkarrasch · 2021-01-10T13:01:11Z

I'll do a last nanosoldier run. Absent surprises, this is good to go.

@nanosoldier runbenchmarks("sparse", vs=":master")

nanosoldier · 2021-01-10T14:53:45Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG

dkarrasch · 2021-01-10T16:24:21Z

Ok, nanosoldier, I want it all green, that's why: last time, and I'll let you go.

@nanosoldier runbenchmarks("sparse", vs=":master")

dkarrasch · 2021-01-10T16:29:50Z

@nanosoldier runbenchmarks("sparse", vs=":master")

nanosoldier · 2021-01-10T18:23:20Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG

dkarrasch · 2021-01-12T11:58:15Z

Ok, how to proceed? Just merge? Are we going to include this in v1.6? I know it's very late in the release cycle, but it is purely addressing performance, no new features, no change in behavior.

ViralBShah · 2021-01-12T14:15:18Z

I think it is good to merge.

@KristofferC Is it ok to backport? I'm tagging it for now.

* Speed up dense-sparse matmul * add one at-simd, minor edits * improve A_mul_Bq for dense-sparse * revert ineffective changes * shift at-inbounds annotation (cherry picked from commit a3369df)

* Speed up dense-sparse matmul * add one at-simd, minor edits * improve A_mul_Bq for dense-sparse * revert ineffective changes * shift at-inbounds annotation

* Speed up dense-sparse matmul * add one at-simd, minor edits * improve A_mul_Bq for dense-sparse * revert ineffective changes * shift at-inbounds annotation (cherry picked from commit a3369df)

Speed up dense-sparse matmul

2b9bf0e

dkarrasch added performance Must go faster domain:linear algebra Linear algebra domain:arrays:sparse Sparse arrays labels Dec 14, 2020

add one at-simd, minor edits

08133a8

This comment has been minimized.

Sign in to view

dkarrasch changed the title ~~Speed up dense-sparse matmul~~ [WIP] Speed up dense-sparse matmul Dec 18, 2020

improve A_mul_Bq for dense-sparse

48d19a4

dkarrasch commented Jan 10, 2021

View reviewed changes

revert ineffective changes

0d8c107

shift at-inbounds annotation

593d744

ViralBShah added the backport 1.6 Change should be backported to release-1.6 label Jan 12, 2021

ViralBShah merged commit a3369df into master Jan 12, 2021

ViralBShah deleted the dk/sparsemul branch January 12, 2021 15:01

KristofferC mentioned this pull request Jan 19, 2021

Backports 1.6-rc1 #39160

Merged

60 tasks

KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Feb 1, 2021

dkarrasch mentioned this pull request Mar 23, 2021

SparseArrays: mul!(W, X', V) much slower than mul!(V, X, W) for Float32 entries #40089

Closed

vtjnash mentioned this pull request Apr 13, 2021

[WIP] improved dense-sparse and sparse-dense matrix multiplication kernels #24045

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Speed up dense-sparse matmul #38876

[WIP] Speed up dense-sparse matmul #38876

dkarrasch commented Dec 14, 2020 •

edited

Loading

ViralBShah commented Dec 15, 2020

dkarrasch commented Dec 15, 2020

oscardssmith commented Dec 15, 2020

dkarrasch commented Dec 17, 2020

ViralBShah commented Dec 17, 2020

nanosoldier commented Dec 17, 2020

This comment has been minimized.

dkarrasch commented Dec 18, 2020

ViralBShah commented Dec 18, 2020

ViralBShah commented Jan 8, 2021

dkarrasch commented Jan 8, 2021

nanosoldier commented Jan 8, 2021

dkarrasch commented Jan 8, 2021

ViralBShah commented Jan 8, 2021

dkarrasch left a comment

dkarrasch Jan 10, 2021

dkarrasch Jan 10, 2021

dkarrasch commented Jan 10, 2021

nanosoldier commented Jan 10, 2021

dkarrasch commented Jan 10, 2021 •

edited

Loading

dkarrasch commented Jan 10, 2021

nanosoldier commented Jan 10, 2021

dkarrasch commented Jan 12, 2021

ViralBShah commented Jan 12, 2021

[WIP] Speed up dense-sparse matmul #38876

[WIP] Speed up dense-sparse matmul #38876

Conversation

dkarrasch commented Dec 14, 2020 • edited Loading

ViralBShah commented Dec 15, 2020

dkarrasch commented Dec 15, 2020

oscardssmith commented Dec 15, 2020

dkarrasch commented Dec 17, 2020

ViralBShah commented Dec 17, 2020

nanosoldier commented Dec 17, 2020

This comment has been minimized.

dkarrasch commented Dec 18, 2020

ViralBShah commented Dec 18, 2020

ViralBShah commented Jan 8, 2021

dkarrasch commented Jan 8, 2021

nanosoldier commented Jan 8, 2021

dkarrasch commented Jan 8, 2021

ViralBShah commented Jan 8, 2021

dkarrasch left a comment

Choose a reason for hiding this comment

dkarrasch Jan 10, 2021

Choose a reason for hiding this comment

dkarrasch Jan 10, 2021

Choose a reason for hiding this comment

dkarrasch commented Jan 10, 2021

nanosoldier commented Jan 10, 2021

dkarrasch commented Jan 10, 2021 • edited Loading

dkarrasch commented Jan 10, 2021

nanosoldier commented Jan 10, 2021

dkarrasch commented Jan 12, 2021

ViralBShah commented Jan 12, 2021

dkarrasch commented Dec 14, 2020 •

edited

Loading

dkarrasch commented Jan 10, 2021 •

edited

Loading