Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hypot performance questionable #36353

Closed
KlausC opened this issue Jun 18, 2020 · 19 comments · Fixed by #36365
Closed

hypot performance questionable #36353

KlausC opened this issue Jun 18, 2020 · 19 comments · Fixed by #36365
Labels
domain:maths Mathematical functions performance Must go faster

Comments

@KlausC
Copy link
Contributor

KlausC commented Jun 18, 2020

I just made massive use of hypot(a, b) with real arguments. I discovered an interesting performance bottleneck.
I am aware of #31922 and #33224, which both focus on accuracy.
I want to present the following measurements:

julia> @benchmark hypot(1.0, 2.0)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     18.065 ns (0.00% GC)
  median time:      18.232 ns (0.00% GC)
  mean time:        18.726 ns (0.00% GC)
  maximum time:     60.926 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark MatrixAlgebra._hypot(1.0, 2.0)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.830 ns (0.00% GC)
  median time:      1.843 ns (0.00% GC)
  mean time:        1.890 ns (0.00% GC)
  maximum time:     10.976 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

That is a runtime factor of 10 compared to the (maybe naive) implementation:

 function _hypot(x, y)
     s = sqrt(abs2(x) + abs2(y))
     _isclean(s) ? s : hypot(x, y)
 end
 function _isclean(a::T) where T<:AbstractFloat
     isfinite(a) &&
    a >= sqrt(4 * floatmin(T))
end

Comparing the accuracy of both implementations gives:

julia> foo(10^6, Float64, hypot)
(4.005836504709864e-17, 4.7168514470489265e-17, 1.168161969438445e-16)

julia> foo(10^6, Float64, MatrixAlgebra._hypot)
(4.6543949325418543e-17, 5.704669978693873e-17, 2.108974170018499e-16)

julia> function foo(n, T, f)
           s3 = s2 = s1 = zero(T)
           for i = 1:n
               a, b = T.(randn(2))
               h = hypot(big(a), big(b))
               d = T(abs(h - f(a, b)) / h)
               s3 = max(s3, d)
               s2 = hypot(s2, d)
               s1 += d
           end
           s1 / n, s2 / sqrt(n), s3
       end
foo (generic function with 3 methods)

The return values of foo are the mean values of the relative deviations from the accurate values in 1-norm, 2-norm, and Inf-norm.
I want to discuss, if those deviations in accuracy justify the performance burden.

@longemen3000
Copy link
Contributor

I like this fast track, the only thing that comes to my mind Is if the ulp error is enough

@kimikage
Copy link
Contributor

kimikage commented Jun 19, 2020

I am concerned about the degradation of accuracy, too.

However, as for the Float32, I think the following specialization may be available (for the intermediate variable s) in most cases.

hypot_f32(a::Float32, b::Float32) = Float32(sqrt(Float64(a)^2 + Float64(b)^2))

(I have not identified the cases which would lead to inaccuracy yet.)

@kimikage
Copy link
Contributor

BTW, the current implementation of hypot uses Base.Math.FMA_NATIVE. (cf. #33011)

@KlausC
Copy link
Contributor Author

KlausC commented Jun 19, 2020

Can anybody run the benchmarks on a machine, with FMA_NATIVE == true? For example as of Intel I5 Generation 4.

@kimikage

This comment has been minimized.

@KlausC
Copy link
Contributor Author

KlausC commented Jun 19, 2020

would also be interested in the times on that machine, using _hypot from my post.

@kimikage
Copy link
Contributor

kimikage commented Jun 19, 2020

Please wait a minute. 🤔

Does this eps fail the constant folding?

scale = eps(sqrt(floatmin(T))) #Rescaling constant

cf. #31922 (comment)
cc: @cfborges, @simonbyrne

@simonbyrne
Copy link
Contributor

It looks like it does...

julia> f(T) = eps(sqrt(floatmin(T)))
f (generic function with 1 method)

julia> @code_llvm f(Float64)

;  @ REPL[9]:1 within `f'
define double @julia_f_17254(%jl_value_t addrspace(10)*) {
top:
; ┌ @ float.jl:745 within `eps'
   %1 = call double @julia_ldexp_918(double 0x3CB0000000000000, i64 -511)
; └
  ret double %1
}

@simonbyrne
Copy link
Contributor

This seems to work correctly:

julia> g(T) = eps(T)*sqrt(floatmin(T))
g (generic function with 1 method)

julia> @code_llvm g(Float64)

;  @ REPL[21]:1 within `g'
define double @julia_g_17323(%jl_value_t addrspace(10)*) {
top:
  ret double 0x1CC0000000000000
}

simonbyrne added a commit that referenced this issue Jun 19, 2020
The scaling constant in `hypot` wasn't being constant folded. Fixes #36353.
@KlausC
Copy link
Contributor Author

KlausC commented Jun 19, 2020

PR #36365 improves the relation from 10-fold to 7-fold on my machine (without FMA_NATIVE). That may go down to 5-fold on FMA machines.
Question is: do we want to pay the price for 1 bit of accuracy in hypot?

@cfborges
Copy link
Contributor

I had noticed that evaluating those constants seems to slow things down but did not know why. Glad someone has a notion about that.
It's clear that a code that avoids unnecessary floating exceptions and attempts to correctly round will be slower. My goal with that algorithm was simply to get to correct rounding since that is what the IEEE standard suggests (but does not require) for hypot. That said, I think for most applications where speed is critical it's better to just use h = sqrt(x*x+y*y) since this can never be off by more than one ulp (as opposed to the Julia 1.1 version of hypot which could be off by two). The only problem with this approach is that the user needs to be sure that there are no extreme arguments since those could lead to unecessary overflow/underflow.

@kimikage
Copy link
Contributor

kimikage commented Jun 19, 2020

micro optimization...

@inline function hypot(x::T, y::T) where T<:AbstractFloat
    absx, absy = abs(x), abs(y)
    
    # Return Inf if either or both inputs is Inf (Compliance with IEEE754)
    isinf(absx) && return absx
    isinf(absy) && return absy
    
    # Order the operands (ax >= ay)
    ax = ifelse(absx < absy, absy, absx)
    ay = ifelse(absx < absy, absx, absy)
    ...

Edit:
I applied the suggestion by @cfborges (#36353 (comment))
They seem to have flipped while I was struggling with NaN. The intention is to use the hardware min/max instructions (without @fastmath). So my bug should have no effect on "speed".

@simonbyrne
Copy link
Contributor

@kimikage can you open a pull request?

@kimikage
Copy link
Contributor

Perhaps I can, but I don't think it's worth the separate PR now. We still need the discussion about #33011, the approach of OP and the Float32 specialization.

@cfborges
Copy link
Contributor

One minor comment I think

# Order the operands
    ax = ifelse(absx < absy, absx, absy)
    ay = ifelse(absx < absy, absy, absx)

would actually need to be

# Order the operands
    ax = ifelse(absx < absy, absy, absx)
    ay = ifelse(absx < absy, absx, absy)

Since the goal is to have x > y.

simonbyrne added a commit that referenced this issue Aug 1, 2020
The scaling constant in `hypot` wasn't being constant folded. Fixes #36353.
@simonbyrne simonbyrne reopened this Aug 1, 2020
@simonbyrne
Copy link
Contributor

I'll leave this open since my fix was only partial

simeonschaub pushed a commit to simeonschaub/julia that referenced this issue Aug 11, 2020
The scaling constant in `hypot` wasn't being constant folded. Fixes JuliaLang#36353.
@Moelf
Copy link
Sponsor Contributor

Moelf commented Sep 5, 2021

@kimikage I like the hypot_f32, do you know if this is supported by others?

julia> a = sqrt(typemax(Float32) |> prevfloat)
1.8446743f19

julia> hypot(a,a)
2.6087633f19

julia> hypot_f32(a,a)
2.6087633f19

@oscardssmith
Copy link
Member

yeah. I would approve that PR. Trying to do fancy stuff with Float32 is almost always worse than doing simple stuff with Float64.

@ViralBShah ViralBShah added domain:maths Mathematical functions performance Must go faster labels Mar 12, 2022
@ViralBShah
Copy link
Member

hypot performance is greatly improved (perhaps because of #42122):

julia> @benchmark hypot(1.0, 2.0)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.430 ns … 10.624 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.437 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.442 ns ±  0.097 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▄█▆▆▆▃                                                   
  ▂▅████████▇▇▇▆▁▇▆▅▃▂▂▂▂▁▂▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▃▃▃▃▃▃▃▂▂▂▂▂ ▃
  1.43 ns        Histogram: frequency by time        1.48 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:maths Mathematical functions performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants