Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use @simd in _vecdot #603

Merged
merged 4 commits into from
May 1, 2019
Merged

Use @simd in _vecdot #603

merged 4 commits into from
May 1, 2019

Conversation

tkoolen
Copy link
Contributor

@tkoolen tkoolen commented Apr 30, 2019

Ref #512. Use @simd in _vecdot instead of manually unrolling and relying on the SLP vectorizer.

I noticed that the current implementation never uses 256-bit wide instructions (no ymm registers being used), while the @simd version does. Note that using @simd in conjunction with the statically known size of the matrix, the loop is fully unrolled up to a reasonable threshold even without using a generated function.

Benchmark 1

The naive benchmark; but I'm doubtful that this gives accurate results.

using StaticArrays
using LinearAlgebra
using BenchmarkTools

for n in [0 : 32; 100]
    # println(n)
    b = @benchmark dot(x, y) setup = begin
        x = rand(SVector{$n, Float64})
        y = rand(SVector{$n, Float64})
        # x = rand($n)
        # y = rand($n)
    end
    println(minimum(b.times))
end

Results:

n Vector before (ns) after (ns) Improvement
0 6.143 0.033 0.033 0%
1 10.051 1.504 1.548 -3%
2 10.531 1.744 1.549 11%
3 10.767 1.694 1.504 11%
4 11.516 1.651 1.663 -1%
5 11.728 1.654 1.700 -3%
6 12.198 1.943 1.804 7%
7 12.441 2.034 1.925 5%
8 12.914 2.358 1.734 26%
9 13.383 2.830 1.818 36%
10 13.868 3.257 1.925 41%
11 14.119 3.567 2.312 35%
12 14.601 4.034 2.164 46%
13 15.093 4.507 2.261 50%
14 15.765 4.912 2.484 49%
15 16.209 5.330 2.859 46%
16 12.557 5.743 2.190 62%
17 13.299 6.618 2.230 66%
18 14.031 7.334 2.759 62%
19 14.530 8.087 2.963 63%
20 15.095 8.212 2.887 65%
21 16.024 8.853 2.706 69%
22 16.920 9.457 3.354 65%
23 17.576 9.724 3.687 62%
24 18.084 10.118 3.150 69%
25 19.034 10.666 2.946 72%
26 19.785 11.177 3.728 67%
27 20.078 11.632 3.906 66%
28 20.881 12.372 4.025 67%
29 21.931 13.560 3.736 72%
30 22.603 14.221 4.060 71%
31 23.542 15.352 5.525 64%
32 13.158 16.121 5.030 69%
100 23.990 67.351 12.907 81%

The Vector column corresponds to using plain Vectors as a baseline. It should be noted that for n = 1 and n = 2, the code_native is the same before and after.

Benchmark 2

Here's perhaps a more realistic usage scenario:

using StaticArrays
using LinearAlgebra
using BenchmarkTools

for n in [0 : 32; 100]
    # println(n)
    vec_length_max = 100_000
    m = vec_length_max ÷ max(1, n)
    xs = [rand(SVector{n, Float64}) for _ = 1 : m]
    ys = [rand(SVector{n, Float64}) for _ = 1 : m]
    # xs = [rand(n) for _ = 1 : m]
    # ys = [rand(n) for _ = 1 : m]
    zs = zeros(m)
    b = @benchmark begin
        @inbounds $zs .= dot.($xs, $ys)
    end evals=1
    println(minimum(b.times) / m)
end

Results:

n Vector before (ns) after (ns) Improvement
0 6.143 0.140 0.139 0%
1 10.051 0.368 0.367 0%
2 10.531 0.599 0.583 3%
3 10.767 1.792 1.732 3%
4 11.516 2.269 2.213 2%
5 11.728 2.995 2.899 3%
6 12.198 3.502 2.160 38%
7 12.441 4.260 2.492 41%
8 12.914 3.671 3.572 3%
9 13.383 3.902 3.711 5%
10 13.868 4.384 4.093 7%
11 14.119 4.725 4.482 5%
12 14.601 5.188 5.074 2%
13 15.093 5.543 5.453 2%
14 15.765 6.303 6.013 5%
15 16.209 6.938 6.785 2%
16 12.557 7.364 7.564 -3%
17 13.299 7.484 7.338 2%
18 14.031 8.233 8.065 2%
19 14.530 8.948 8.922 0%
20 15.095 9.577 9.477 1%
21 16.024 10.023 9.836 2%
22 16.920 10.552 10.419 1%
23 17.576 11.297 10.843 4%
24 18.084 11.569 11.685 -1%
25 19.034 12.052 11.921 1%
26 19.785 12.598 12.459 1%
27 20.078 13.092 13.227 -1%
28 20.881 13.688 14.067 -3%
29 21.931 14.337 14.959 -4%
30 22.603 15.488 10.679 31%
31 23.542 16.047 13.199 18%
32 13.158 16.854 10.547 37%
100 23.990 70.075 41.048 41%

Note that n = 30 is the point at which the loop is no longer fully unrolled.

Example code_native difference (n = 6)

Before:

	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ linalg.jl:217 within `dot'
; │┌ @ linalg.jl:219 within `_vecdot'
; ││┌ @ linalg.jl:230 within `macro expansion'
; │││┌ @ linalg.jl:217 within `*'
	vmovsd	(%edi), %xmm0           ## xmm0 = mem[0],zero
	vmulsd	(%esi), %xmm0, %xmm0
	vmovsd	8(%edi), %xmm1          ## xmm1 = mem[0],zero
	vmulsd	8(%esi), %xmm1, %xmm1
	vmovsd	16(%edi), %xmm2         ## xmm2 = mem[0],zero
	vmulsd	16(%esi), %xmm2, %xmm2
; ││└└
; ││┌ @ float.jl:395 within `macro expansion'
	vaddsd	%xmm1, %xmm0, %xmm0
	vaddsd	%xmm2, %xmm0, %xmm0
; ││└
; ││┌ @ linalg.jl:230 within `macro expansion'
; │││┌ @ float.jl:399 within `*'
	vmovsd	24(%edi), %xmm1         ## xmm1 = mem[0],zero
	vmulsd	24(%esi), %xmm1, %xmm1
; │││└
; │││┌ @ float.jl:395 within `+'
	vaddsd	%xmm1, %xmm0, %xmm0
; │││└
; │││┌ @ float.jl:399 within `*'
	vmovupd	32(%edi), %xmm1
	vmulpd	32(%esi), %xmm1, %xmm1
; │││└
; │││┌ @ float.jl:395 within `+'
	vaddsd	%xmm1, %xmm0, %xmm0
	vpermilpd	$1, %xmm1, %xmm1 ## xmm1 = xmm1[1,0]
	vaddsd	%xmm1, %xmm0, %xmm0
; │└└└
	retl
	nopl	(%eax,%eax)
;

After:

	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ linalg.jl:217 within `dot'
; │┌ @ linalg.jl:222 within `_vecdot'
; ││┌ @ simdloop.jl:73 within `macro expansion' @ linalg.jl:223
; │││┌ @ linalg.jl:217 within `*'
	vmovupd	(%edi), %ymm0
	vmulpd	(%esi), %ymm0, %ymm0
; │││└
; │││┌ @ float.jl:395 within `+'
	vextractf128	$1, %ymm0, %xmm1
	vaddpd	%ymm1, %ymm0, %ymm0
	vhaddpd	%ymm0, %ymm0, %ymm0
; │││└
; │││ @ simdloop.jl:73 within `macro expansion' @ float.jl:399
	vmovsd	32(%edi), %xmm1         ## xmm1 = mem[0],zero
	vmovsd	40(%edi), %xmm2         ## xmm2 = mem[0],zero
	vfmadd231sd	32(%esi), %xmm1, %xmm0
	vfmadd231sd	40(%esi), %xmm2, %xmm0
; │└└
	vzeroupper
	retl
;

@coveralls
Copy link

coveralls commented Apr 30, 2019

Coverage Status

Coverage decreased (-0.06%) to 81.114% when pulling 90ffc02 on tkoolen:tk/dot-simdloop into ec26f0a on JuliaArrays:master.

src/linalg.jl Outdated
return quote
@_inline_meta
@inbounds return $expr
za = zero(a[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked if this still works for block vectors, SVector{Vector{<:Number}}? (Part of the advantage of unrolling in this case was that you don't need to start with a "zero" element).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that case is already being tested for:

@test @inferred(dot(@SVector[[1,2],[3,4]], @SVector[[1,2],[3,4]])) === 30

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting to the 2 : prod(S) loop results in the same performance in benchmark 2 up to n = 30 (the point at which the loop is no longer unrolled). After that point, there's a significant performance drop compared to the PR branch (up to 2x for n = 32!).

@KristofferC
Copy link
Contributor

KristofferC commented Apr 30, 2019

Tangential ref: #494.

Copy link
Member

@c42f c42f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

src/linalg.jl Outdated
expr = :(adjoint(a[1]) * b[1])
for j = 2:prod(S)
expr = :($expr + adjoint(a[$j]) * b[$j])
return zero(promote_op(*, eltype(a), eltype(b)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be better as zero(one(eltype(a)) * one(eltype(b)))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied this line over unmodified, but I'm happy to change it. How about the changes I pushed?

@tkoolen
Copy link
Contributor Author

tkoolen commented Apr 30, 2019

Tangential ref: #494.

Ah right, I'd forgotten about that PR, it's a little more than tangentially related. Well, this one's up to date with master, and I think the loop range change and getting rid of @generated are slight improvements.

@c42f
Copy link
Member

c42f commented Apr 30, 2019

Looks nice and clean to me, provided the optimizer is happy with that modification 👍

@tkoolen
Copy link
Contributor Author

tkoolen commented Apr 30, 2019

I'll leave this open for comments for another day and then I'll merge. Thanks for the quick reviews everybody!

After that, perhaps we can try the other reduction operations from #494 again using this approach.

@tkoolen
Copy link
Contributor Author

tkoolen commented Apr 30, 2019

I decided to expand the scope of this PR just a little bit to include bilinear_vecdot, since it's so similar. There was in fact a lot of code duplication, so I merged _vecdot with _bilinear_vecdot. I checked that benchmark results for dot are still the same.

In doing so, I was wondering whether (x, y) -> adjoint(x) * y shouldn't just be replaced with dot, given that that's what the generic dot implementation in Base does: https://github.com/JuliaLang/julia/blob/b4e1d0eef9c8f2fe3a6e43847bb96cceeac4f4fc/stdlib/LinearAlgebra/src/generic.jl#L758.

@c42f
Copy link
Member

c42f commented May 1, 2019

Beautiful!

@c42f
Copy link
Member

c42f commented May 1, 2019

CI errors seem related to latest julia, not this PR?

@tkoolen
Copy link
Contributor Author

tkoolen commented May 1, 2019

Yes, the cron job build of master also fails on Julia nightly.
https://travis-ci.org/JuliaArrays/StaticArrays.jl/builds/526388464

@tkoolen tkoolen merged commit 407c65f into JuliaArrays:master May 1, 2019
@tkoolen tkoolen deleted the tk/dot-simdloop branch May 1, 2019 19:39
@c42f c42f mentioned this pull request Jun 6, 2019
@c42f c42f mentioned this pull request Aug 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants