faster `hypot` for `Float32` and `Float16` #42122

Moelf · 2021-09-05T02:30:58Z

Before:

julia> @benchmark hypot(x,y) setup=(begin x,y = rand(Float32, 2) end) evals=10000
BenchmarkTools.Trial: 10000 samples with 10000 evaluations.
 Range (min … max):  3.051 ns … 8.185 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.700 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.616 ns ± 0.208 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                          ▁▇▇  ▂▇█▅▁▂▂▂     ▂
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁████▇█████████████ █
  3.05 ns     Histogram: log(frequency) by time     6.27 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark hypot(x,y) setup=(begin x,y = rand(Float16, 2) end) evals=10000
BenchmarkTools.Trial: 10000 samples with 10000 evaluations.
 Range (min … max):  3.133 ns … 10.518 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.265 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.229 ns ±  0.280 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                          ▇▆▃▆▁▁█▄▇▂▄▂▁▁▁    ▂
  ▆▅▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄█████████████████▇ █
  3.13 ns      Histogram: log(frequency) by time     7.06 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

After:

julia> @benchmark _hypot(x,y) setup=(begin x,y = rand(Float32, 2) end) evals=10000
BenchmarkTools.Trial: 10000 samples with 10000 evaluations.
 Range (min … max):  2.161 ns … 5.114 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.188 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.197 ns ± 0.082 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂ ▂ ▇█ ▂                                                  ▁
  █▁█▃██▁█▄▃█▃▁▄▃▁▁▁▁▅▃▁▃▁▃▃▄▄▁▁▁▃▁▃▁▄▄▄▃▅▃▄▄▄▄▄▅▃▄▄▆▅▄▄▄▁▅ █
  2.16 ns     Histogram: log(frequency) by time     2.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark _hypot(x,y) setup=(begin x,y = rand(Float16, 2) end) evals=10000
BenchmarkTools.Trial: 10000 samples with 10000 evaluations.
 Range (min … max):  2.547 ns … 3.631 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.620 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.630 ns ± 0.070 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁ ▆ ▄ █ ▆   ▃                                          ▂
  ▇▇▃█▁█▆█▃█▄█▁▇▄██▅▄▅▄▄▃▄▅▅▅▅▄▄▅▅▅▅▆▃▆▅▆▇▇▇▆▆▆▃▅▃▅▅▅▅▄▅▅▄▆ █
  2.55 ns     Histogram: log(frequency) by time        3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Moelf · 2021-09-05T02:31:08Z

close #36353

now, I'm not sure the old _hypot is needed for real number since 0, NaN, Inf was already doing the correct thing but I guess we will see from test results.

oscardssmith · 2021-09-05T03:40:25Z

I think if you write @fastmath sqrt(x*x+y*y), it will let x^2+y^2 turn into fma(x, x, y*y) which should be faster.

brett-mahar · 2021-09-05T09:02:15Z

@oscardssmith there was talk of removing fastmath from Julia, apparently its a bit shonky:

#36246

Moelf · 2021-09-05T11:15:45Z

I think FMA makes it much slower on machine without FMA? don't remember the conclusion

oscardssmith · 2021-09-05T17:10:55Z

@brett-mahar while we might eventually get rid of it, we won't until there is a good replacement. The reason we still have @fastmath is that you need it to get fast code in a bunch of circumstances.
@Moelf That's why I recommended using @fastmath rather than explicitly writing an fma in. This way, LLVM will be able to use fma on the CPUs where there is hardware support, and just do x*x+y*y on cpus that don't have fma in hardware.

Moelf · 2021-09-05T18:40:30Z

@fastmath doesn't exist at this stage

julia/base/Base.jl

Line 334 in 8812c5c

include("fastmath.jl")

dkarrasch

Seems like you need an explicit inf handling. Since the Float16 case didn't fail in the tests, maybe we should add it? Is Float16 tested at all?

base/math.jl

KristofferC · 2021-09-06T13:21:47Z

This way, LLVM will be able to use fma on the CPUs where there is hardware support, and just do xx+yy on cpus that don't have fma in hardware.

Isn't that just muladd?

I don't think we should add a bunch of @fastmath in these types of functions because it is underspecified what transformations @fastmath actually allows. As it is right now, @fastmath is not used anywhere in Base.

oscardssmith · 2021-09-06T13:47:08Z

Oh yeah, muladd is probably what we want

dkarrasch · 2021-09-08T08:37:14Z

Shall we have tests for the Float16 case, including infs and nans? Otherwise this is good to go, if there are no accuracy concerns left.

base/math.jl

KristofferC · 2021-09-16T19:19:04Z

Saw this on Wikipedia on FMA (https://en.wikipedia.org/wiki/Multiply–accumulate_operation#Fused_multiply–add):

Fused multiply–add can usually be relied on to give more accurate results. However, William Kahan has pointed out that it can give problems if used unthinkingly. If x^2 − y^2 is evaluated as ((x × x) − y × y) (following Kahan's suggested notation in which redundant parentheses direct the compiler to round the (x × x) term first) using fused multiply–add, then the result may be negative even when x = y due to the first multiplication discarding low significance bits. This could then lead to an error if, for instance, the square root of the result is then evaluated.

Here we do that except x^2 + y^2. I guess that's fine then?

oscardssmith · 2021-09-16T20:23:45Z

The reason this works is because we are doing a higher precision fma. the multiplication of 2 Float32 has 46 bits of precision or less, so a Float64 can store y*y exactly.

ViralBShah · 2021-11-10T03:38:02Z

Bump. What do we need to do to get this in?

oscardssmith · 2021-11-10T03:59:28Z

Not very much.

* faster hypot for Float32 and Float16

faster hypot for Float32 and Float16

692d567

Moelf changed the title ~~faster hypot for Float32 and Float16~~ faster hypot for Float32 and Float16 Sep 5, 2021

Moelf force-pushed the faster_hypot branch from 65d07b3 to 6b286d8 Compare September 5, 2021 17:22

make mobile less sucky

18499b7

Moelf force-pushed the faster_hypot branch from 6b286d8 to cf94dfe Compare September 5, 2021 18:47

move fastmath earlier

c5d3cd6

Moelf force-pushed the faster_hypot branch from cf94dfe to c5d3cd6 Compare September 5, 2021 18:48

dkarrasch reviewed Sep 6, 2021

View reviewed changes

base/math.jl Outdated Show resolved Hide resolved

base/math.jl Outdated Show resolved Hide resolved

dkarrasch added domain:maths Mathematical functions performance Must go faster labels Sep 6, 2021

Moelf added 3 commits September 6, 2021 10:09

clean up and add inf

4f512ce

muladd and clean up

0ba3c7c

fix for inf

cb9bb07

oscardssmith approved these changes Sep 8, 2021

View reviewed changes

stevengj reviewed Sep 13, 2021

View reviewed changes

base/math.jl Show resolved Hide resolved

oscardssmith merged commit d279aed into JuliaLang:master Nov 10, 2021

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022

faster hypot for Float32 and Float16 (JuliaLang#42122)

30723b0

* faster hypot for Float32 and Float16

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022

faster hypot for Float32 and Float16 (JuliaLang#42122)

6f0253a

* faster hypot for Float32 and Float16

ViralBShah mentioned this pull request Mar 12, 2022

hypot performance questionable #36353

Closed

Moelf deleted the faster_hypot branch March 13, 2022 03:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faster `hypot` for `Float32` and `Float16` #42122

faster `hypot` for `Float32` and `Float16` #42122

Moelf commented Sep 5, 2021 •

edited

Loading

Moelf commented Sep 5, 2021 •

edited

Loading

oscardssmith commented Sep 5, 2021

brett-mahar commented Sep 5, 2021

Moelf commented Sep 5, 2021

oscardssmith commented Sep 5, 2021

Moelf commented Sep 5, 2021 •

edited

Loading

dkarrasch left a comment

KristofferC commented Sep 6, 2021 •

edited

Loading

oscardssmith commented Sep 6, 2021

dkarrasch commented Sep 8, 2021

KristofferC commented Sep 16, 2021

oscardssmith commented Sep 16, 2021

ViralBShah commented Nov 10, 2021

oscardssmith commented Nov 10, 2021

faster hypot for Float32 and Float16 #42122

faster hypot for Float32 and Float16 #42122

Conversation

Moelf commented Sep 5, 2021 • edited Loading

Moelf commented Sep 5, 2021 • edited Loading

oscardssmith commented Sep 5, 2021

brett-mahar commented Sep 5, 2021

Moelf commented Sep 5, 2021

oscardssmith commented Sep 5, 2021

Moelf commented Sep 5, 2021 • edited Loading

dkarrasch left a comment

Choose a reason for hiding this comment

KristofferC commented Sep 6, 2021 • edited Loading

oscardssmith commented Sep 6, 2021

dkarrasch commented Sep 8, 2021

KristofferC commented Sep 16, 2021

oscardssmith commented Sep 16, 2021

ViralBShah commented Nov 10, 2021

oscardssmith commented Nov 10, 2021

faster `hypot` for `Float32` and `Float16` #42122

faster `hypot` for `Float32` and `Float16` #42122

Moelf commented Sep 5, 2021 •

edited

Loading

Moelf commented Sep 5, 2021 •

edited

Loading

Moelf commented Sep 5, 2021 •

edited

Loading

KristofferC commented Sep 6, 2021 •

edited

Loading