Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster hypot for Float32 and Float16 #42122

Merged
merged 6 commits into from
Nov 10, 2021
Merged

Conversation

Moelf
Copy link
Sponsor Contributor

@Moelf Moelf commented Sep 5, 2021

Before:

julia> @benchmark hypot(x,y) setup=(begin x,y = rand(Float32, 2) end) evals=10000
BenchmarkTools.Trial: 10000 samples with 10000 evaluations.
 Range (min  max):  3.051 ns  8.185 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     5.700 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.616 ns ± 0.208 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                          ▁▇▇  ▂▇█▅▁▂▂▂     ▂
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁████▇█████████████ █
  3.05 ns     Histogram: log(frequency) by time     6.27 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark hypot(x,y) setup=(begin x,y = rand(Float16, 2) end) evals=10000
BenchmarkTools.Trial: 10000 samples with 10000 evaluations.
 Range (min  max):  3.133 ns  10.518 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     6.265 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.229 ns ±  0.280 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                          ▇▆▃▆▁▁█▄▇▂▄▂▁▁▁    ▂
  ▆▅▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄█████████████████▇ █
  3.13 ns      Histogram: log(frequency) by time     7.06 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

After:

julia> @benchmark _hypot(x,y) setup=(begin x,y = rand(Float32, 2) end) evals=10000
BenchmarkTools.Trial: 10000 samples with 10000 evaluations.
 Range (min  max):  2.161 ns  5.114 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.188 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.197 ns ± 0.082 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂ ▂ ▇█ ▂                                                  ▁
  █▁█▃██▁█▄▃█▃▁▄▃▁▁▁▁▅▃▁▃▁▃▃▄▄▁▁▁▃▁▃▁▄▄▄▃▅▃▄▄▄▄▄▅▃▄▄▆▅▄▄▄▁▅ █
  2.16 ns     Histogram: log(frequency) by time     2.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark _hypot(x,y) setup=(begin x,y = rand(Float16, 2) end) evals=10000
BenchmarkTools.Trial: 10000 samples with 10000 evaluations.
 Range (min  max):  2.547 ns  3.631 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.620 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.630 ns ± 0.070 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁ ▆ ▄ █ ▆   ▃                                          ▂
  ▇▇▃█▁█▆█▃█▄█▁▇▄██▅▄▅▄▄▃▄▅▅▅▅▄▄▅▅▅▅▆▃▆▅▆▇▇▇▆▆▆▃▅▃▅▅▅▅▄▅▅▄▆ █
  2.55 ns     Histogram: log(frequency) by time        3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@Moelf
Copy link
Sponsor Contributor Author

Moelf commented Sep 5, 2021

close #36353

now, I'm not sure the old _hypot is needed for real number since 0, NaN, Inf was already doing the correct thing but I guess we will see from test results.

@Moelf Moelf changed the title faster hypot for Float32 and Float16 faster hypot for Float32 and Float16 Sep 5, 2021
@oscardssmith
Copy link
Member

I think if you write @fastmath sqrt(x*x+y*y), it will let x^2+y^2 turn into fma(x, x, y*y) which should be faster.

@brett-mahar
Copy link

@oscardssmith there was talk of removing fastmath from Julia, apparently its a bit shonky:

#36246

@Moelf
Copy link
Sponsor Contributor Author

Moelf commented Sep 5, 2021

I think FMA makes it much slower on machine without FMA? don't remember the conclusion

@oscardssmith
Copy link
Member

@brett-mahar while we might eventually get rid of it, we won't until there is a good replacement. The reason we still have @fastmath is that you need it to get fast code in a bunch of circumstances.
@Moelf That's why I recommended using @fastmath rather than explicitly writing an fma in. This way, LLVM will be able to use fma on the CPUs where there is hardware support, and just do x*x+y*y on cpus that don't have fma in hardware.

@Moelf
Copy link
Sponsor Contributor Author

Moelf commented Sep 5, 2021

@fastmath doesn't exist at this stage

include("fastmath.jl")

Copy link
Member

@dkarrasch dkarrasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you need an explicit inf handling. Since the Float16 case didn't fail in the tests, maybe we should add it? Is Float16 tested at all?

base/math.jl Outdated Show resolved Hide resolved
base/math.jl Outdated Show resolved Hide resolved
@dkarrasch dkarrasch added domain:maths Mathematical functions performance Must go faster labels Sep 6, 2021
@KristofferC
Copy link
Sponsor Member

KristofferC commented Sep 6, 2021

This way, LLVM will be able to use fma on the CPUs where there is hardware support, and just do xx+yy on cpus that don't have fma in hardware.

Isn't that just muladd?

I don't think we should add a bunch of @fastmath in these types of functions because it is underspecified what transformations @fastmath actually allows. As it is right now, @fastmath is not used anywhere in Base.

@oscardssmith
Copy link
Member

Oh yeah, muladd is probably what we want

@dkarrasch
Copy link
Member

Shall we have tests for the Float16 case, including infs and nans? Otherwise this is good to go, if there are no accuracy concerns left.

base/math.jl Show resolved Hide resolved
@KristofferC
Copy link
Sponsor Member

Saw this on Wikipedia on FMA (https://en.wikipedia.org/wiki/Multiply–accumulate_operation#Fused_multiply–add):

Fused multiply–add can usually be relied on to give more accurate results. However, William Kahan has pointed out that it can give problems if used unthinkingly. If x^2 − y^2 is evaluated as ((x × x) − y × y) (following Kahan's suggested notation in which redundant parentheses direct the compiler to round the (x × x) term first) using fused multiply–add, then the result may be negative even when x = y due to the first multiplication discarding low significance bits. This could then lead to an error if, for instance, the square root of the result is then evaluated.

Here we do that except x^2 + y^2. I guess that's fine then?

@oscardssmith
Copy link
Member

The reason this works is because we are doing a higher precision fma. the multiplication of 2 Float32 has 46 bits of precision or less, so a Float64 can store y*y exactly.

@ViralBShah
Copy link
Member

Bump. What do we need to do to get this in?

@oscardssmith oscardssmith merged commit d279aed into JuliaLang:master Nov 10, 2021
@oscardssmith
Copy link
Member

Not very much.

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
* faster hypot for Float32 and Float16
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
* faster hypot for Float32 and Float16
@Moelf Moelf deleted the faster_hypot branch March 13, 2022 03:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:maths Mathematical functions performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants