Finish removing the BigInts from * for FD{Int128}! #94

NHDaly · 2024-06-13T02:27:53Z

Finally implements the fast-multiplication optimization from #45! :)

This PR was optimizes FixedDecimal{Int64} and FixedDecimal{Int128}. In the process, we also undid an earlier optimization which is no longer needed after julia 1.8, and that makes multiplication about 2x fast for the smaller int types as well! 🎉

This is a follow-up to
#93, which introduces an Int256 type for widemul. However after that PR, the fldmod still required 2 BigInt allocations.

Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int128 and for (U)Int256, which briefly widens to Int512 (😅), to perform the fldmod by the constant 10^f coefficient.

After this PR, FD multiply performance scales linear with the number of bits. FD{Int128} has no allocations, and is only 2x slower than 64-bit. :) And it makes all other multiplications ~2x faster.

master:

julia> using FixedPointDecimals, BenchmarkTools

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  83.750 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  84.916 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  249.083 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  4.660 ms (248829 allocations: 4.70 MiB)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  56.791 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  60.000 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  172.958 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  5.408 ms (308621 allocations: 6.14 MiB)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

After PR #93:

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  83.750 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  85.000 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  248.958 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  4.673 ms (160798 allocations: 3.22 MiB)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  56.791 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  60.041 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  173.000 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  4.750 ms (190708 allocations: 4.82 MiB)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

After this PR:

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  48.458 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  57.000 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  90.708 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  180.125 μs (0 allocations: 0 bytes)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  42.708 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  51.250 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  80.042 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  162.417 μs (0 allocations: 0 bytes)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

Finally implements the fast-multiplication optimization from #45, but this time for 128-bit FixedDecimals! :) This is a follow-up to #93, which introduces an Int256 type for widemul. However, the fldmod still required 2 BigInt allocations. Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int256, which briefly widens to Int512 (😅) to perform the fldmod by the constant 10^f coefficient. This brings 128-bit FD multiply to the same performance as 64-bit. :)

NHDaly · 2024-06-13T02:29:53Z

Huzzah! I finally feel like we can bring the ideas from #45 to this package, and it's still valuable. It turns out that LLVM has already applied this optimization automatically for Int128 div by const (thus * for FixedDecimal{Int64} is already fast). But this PR adds support for FixedDecimal{Int128}, which needed a custom function for Int256 div by const. :)

See here, for my note on FixedDecimal{Int64} support:
#45 (comment)

NHDaly · 2024-06-13T02:39:38Z

Now that there's julia's effects system, this doesn't need @pure, so it's actually possible to do this!
And now that we've introduced BitIntegers.jl, the code is way simpler. 😊 woohoo!

test/runtests.jl

src/FixedPointDecimals.jl

src/fldmod-by-const.jl

Co-authored-by: Curtis Vogt <[email protected]>

NHDaly · 2024-06-13T20:12:20Z

Update 1: It turned out that (of course) * is too fast for @btime to get a good measurement of on its own. I've adjusted the benchmark to perform repeated multiplies, and the true performance is now included in the original comment.

Update 2: It turns out that this is actually faster for all int types (>2x faster for FD{Int64}), even though LLVM was doing a version of this optimization on its own. I'm honestly pretty surprised by that... From what I can tell, for FD{Int32}, the code is identical except for two instruction changes: one changing the order of registers as inputs to add (shouldn't matter), and one changing from comparing hi to comparing gt:

And yet somehow it's 25% faster (80µs -> 60µs).

NHDaly · 2024-06-13T20:18:42Z

Aha, the native code difference comes from the whole benchmark function. Before, the multiply was outlined, and called through a function call.
Now, it's inlined, and i think the whole loop is constant-folded away, so it's actually sort of cheating. 😅
But still, allowing that optimization seems like a positive thing!. The inlining is the main change, i think, for FD{Int32}.

julia> function bench(fd) 
           for _ in 1:10000
               fd = fd * fd
           end
           fd
       end
bench (generic function with 1 method)

# Before:

julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
┌ Warning: /Users/nathandaly/.julia/dev/FixedPointDecimals/src/fldmod-by-const.jl no longer exists, deleted all methods
└ @ Revise ~/.julia/packages/Revise/bAgL0/src/packagedef.jl:666
        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 14, 0
        .globl  _julia_bench_1163               ; -- Begin function julia_bench_1163
        .p2align        2
_julia_bench_1163:                      ; @julia_bench_1163
; %bb.0:                                ; %guard_exit4
        sub     sp, sp, #48
        stp     x20, x19, [sp, #16]             ; 16-byte Folded Spill
        stp     x29, x30, [sp, #32]             ; 16-byte Folded Spill
        ldr     w0, [x0]
        mov     w19, #10000
Lloh0:
        adrp    x20, "_j_*_1165"@GOTPAGE
Lloh1:
        ldr     x20, [x20, "_j_*_1165"@GOTPAGEOFF]
LBB0_1:                                 ; %L2
                                        ; =>This Inner Loop Header: Depth=1
        str     w0, [sp, #12]
        add     x0, sp, #12
        add     x1, sp, #12
        blr     x20
        subs    x19, x19, #1
        b.ne    LBB0_1
; %bb.2:                                ; %guard_exit16
        ldp     x29, x30, [sp, #32]             ; 16-byte Folded Reload
        ldp     x20, x19, [sp, #16]             ; 16-byte Folded Reload
        add     sp, sp, #48
        ret
        .loh AdrpLdrGot Lloh0, Lloh1
                                        ; -- End function
.subsections_via_symbols

After:

julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 14, 0
        .globl  _julia_bench_1161               ; -- Begin function julia_bench_1161
        .p2align        2
_julia_bench_1161:                      ; @julia_bench_1161
; %bb.0:                                ; %guard_exit7
        ldr     w0, [x0]
        mov     w8, #10000
        mov     x9, #57148
        movk    x9, #36175, lsl #16
        movk    x9, #28311, lsl #32
        movk    x9, #33554, lsl #48
        mov     x10, #-1000
LBB0_1:                                 ; %L2
                                        ; =>This Inner Loop Header: Depth=1
        umull   x11, w0, w0
        umulh   x11, x11, x9
        lsr     x12, x11, #9
        mul     x13, x12, x10
        umaddl  x13, w0, w0, x13
        ubfx    x11, x11, #9, #1
        cmp     x13, #500
        cset    w13, hi
        csel    w11, w11, w13, eq
        add     w0, w11, w12
        subs    x8, x8, #1
        b.ne    LBB0_1
; %bb.2:                                ; %guard_exit19
        ret
                                        ; -- End function
.subsections_via_symbols

EDIT: And even with a runtime variable for the loop count, so it can't optimize away the loop, it's still just as much faster in the new code:

julia> @noinline function bench(fd, N) 
           for _ in 1:N
               fd = fd * fd
           end
           fd
       end
bench (generic function with 2 methods)

julia> @btime bench($(FixedDecimal{UInt32,3}(1.234)), $10000)
  48.416 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

omus

Did another surface level review. I can do a full review if no one else will. I will need to block off some time though as there's a bit going on in this PR.

src/FixedPointDecimals.jl

Co-authored-by: Curtis Vogt <[email protected]>

NHDaly · 2024-06-17T03:49:53Z

I think Tomas should be able to review when he's back from vacation. He's out til thursday. :)
Thanks for the offer, @omus! :)

…change Tests all FD{(U)Int16} values. Tests most corner cases for FD{(U)Int128} values.

We must use the precise value of 2^nbits(T) in order to get the correct division in all cases. ....... UGH except now the Int64 tests aren't passing.

LLVM _does_ do this automatically if you directly call `fldmod`. Improve the comments as well.

… not fld(x,y),mod(x,y), even if y is a constant! :) Improved that for julia versions 1.8 - 1.9

NHDaly · 2024-07-10T01:31:34Z

Okay, after reviewing and integrating all of Tomas' changes, and cleaning things up, i think this is a great improvement and is ready to go! :) Tomas and I are going to look through it one more time, but otherwise I think this is fully ready for review. Thanks!

NHDaly · 2024-07-10T01:40:01Z

I also updated the PR comment with the final perf numbers, which look 👌

NHDaly requested review from Drvi and TotalVerb June 13, 2024 02:28

Support older versions of julia

4e53f3d

NHDaly added 4 commits June 12, 2024 20:42

Comments

dfd41b1

Disable fldmod-by-const tests on older julia

efee91b

Fix one other case of iseven allocating a BigInt

a03d754

Apply this optimization to FD{Int64} as well.

4ed8ebf

omus reviewed Jun 13, 2024

View reviewed changes

test/runtests.jl Outdated Show resolved Hide resolved

src/FixedPointDecimals.jl Outdated Show resolved Hide resolved

src/fldmod-by-const.jl Outdated Show resolved Hide resolved

NHDaly and others added 3 commits June 13, 2024 13:49

Adjust to run for all integer types!

3f39b8a

Clarify the _unsigned(x) methods with comments

20c66f2

Apply suggestions from code review

f2958ba

Co-authored-by: Curtis Vogt <[email protected]>

NHDaly force-pushed the nhd-Int128-fastmul-noallocs branch from 82eb32d to f2958ba Compare June 13, 2024 19:59

NHDaly requested a review from omus June 13, 2024 20:06

omus reviewed Jun 14, 2024

View reviewed changes

src/FixedPointDecimals.jl Outdated Show resolved Hide resolved

Update src/FixedPointDecimals.jl

0019bb0

Co-authored-by: Curtis Vogt <[email protected]>

NHDaly added 2 commits June 19, 2024 11:51

Add extensive tests for multiplication correctness, to cover the new …

4a53703

…change Tests all FD{(U)Int16} values. Tests most corner cases for FD{(U)Int128} values.

Named testsets to make failures easier to identify

27a3e0f

NHDaly force-pushed the nhd-Int128-fastmul-noallocs branch from 44686f3 to ac3302e Compare June 19, 2024 19:16

Fix off-by-one error in rounding truncation in calculate_inverse_coeff()

e4cb73b

We must use the precise value of 2^nbits(T) in order to get the correct division in all cases. ....... UGH except now the Int64 tests aren't passing.

NHDaly force-pushed the nhd-Int128-fastmul-noallocs branch from ac3302e to e4cb73b Compare June 19, 2024 19:16

NHDaly and others added 4 commits June 19, 2024 14:46

Add some comments and requires

73f6547

Copy/pasted definition for unsigned numbers straight from the book

4e9fdd6

Have magicgu support arbitrary integer sizes

188933d

Use the formulas from Hacker's delight for both signed and unsigned Ints

f6d375c

Drvi and others added 6 commits June 25, 2024 19:41

.

b79c873

.

4f4d17a

Restrict back to just Int128 & Int256 for custom div_by_const

eaeaddf

LLVM _does_ do this automatically if you directly call `fldmod`. Improve the comments as well.

It turns out that in newer versions of julia, you should call fldmod,…

a2dcf56

… not fld(x,y),mod(x,y), even if y is a constant! :) Improved that for julia versions 1.8 - 1.9

Reorganize the functions to be top-down

ef578e9

More thorough tests for flmdod_by_const

8df68b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish removing the BigInts from * for FD{Int128}! #94

Finish removing the BigInts from * for FD{Int128}! #94

NHDaly commented Jun 13, 2024 •

edited

Loading

NHDaly commented Jun 13, 2024

NHDaly commented Jun 13, 2024

NHDaly commented Jun 13, 2024 •

edited

Loading

NHDaly commented Jun 13, 2024 •

edited

Loading

omus left a comment

NHDaly commented Jun 17, 2024 •

edited

Loading

NHDaly commented Jul 10, 2024

NHDaly commented Jul 10, 2024

Finish removing the BigInts from * for FD{Int128}! #94

Are you sure you want to change the base?

Finish removing the BigInts from * for FD{Int128}! #94

Conversation

NHDaly commented Jun 13, 2024 • edited Loading

NHDaly commented Jun 13, 2024

NHDaly commented Jun 13, 2024

NHDaly commented Jun 13, 2024 • edited Loading

NHDaly commented Jun 13, 2024 • edited Loading

omus left a comment

Choose a reason for hiding this comment

NHDaly commented Jun 17, 2024 • edited Loading

NHDaly commented Jul 10, 2024

NHDaly commented Jul 10, 2024

NHDaly commented Jun 13, 2024 •

edited

Loading

NHDaly commented Jun 13, 2024 •

edited

Loading

NHDaly commented Jun 13, 2024 •

edited

Loading

NHDaly commented Jun 17, 2024 •

edited

Loading