Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add evalpoly function #32753

Merged
merged 75 commits into from
Nov 21, 2019
Merged

Add evalpoly function #32753

merged 75 commits into from
Nov 21, 2019

Conversation

MasonProtter
Copy link
Contributor

@MasonProtter MasonProtter commented Jul 31, 2019

Code taken from Steven G. Johnson's Juliacon 2019 talk. Gives same performance as @horner.

horner(x) = zero(x)
horner(x, p1) = p1
horner(x, p1, ps...) = muladd(x, horner(x, ps...), p1)

x = 10

f(x) = horner(x, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0)
g(x) = Base.Math.@horner(x, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0)

@btime f($(Ref(x))[]) #  8.921 ns (0 allocations: 0 bytes)
@btime g($(Ref(x))[]) #  8.921 ns (0 allocations: 0 bytes)

A lot has changed since I wrote this PR originally and the above code is not the current implementation.

See #32753 (comment) for the current state of this PR.

@KristofferC
Copy link
Sponsor Member

What's the point of having both?

@MasonProtter
Copy link
Contributor Author

I'd say there's no reason to have the macro if the function can do the same job just as fast, but the macro can't be removed until 2.0.

@KristofferC
Copy link
Sponsor Member

KristofferC commented Jul 31, 2019

Hoping that the compiler will end up going through 12 calls of inlining (which might be true for floats but not for other types etc) seems worse. Although I can see the argument that it might facilitate writing more generic code.

If the compiler fails to inline through the whole thing it is likely slower than the loop based version though.

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Jul 31, 2019

Good points! I guess my thinking was just that macros tend to be a bit more brittle than functions. One thing I like about this is that I can do things like

horner(x, somevector...)

I know that's not going to produce optimal code unless somevector is a compile time constant, but it's at least more flexible than @horner.

However, this should probably be clarified in the docstring and the note I wrote that it should be just as fast as @horner is a bit glib and requires caveats about constant propagation and inlining.

@macd
Copy link
Contributor

macd commented Jul 31, 2019

Unless I've done this wrong (which is always possible) it looks like the ability to splat into myhorner is a mis-feature?

julia v1.3> using BenchmarkTools

julia v1.3> function fhorner(x, coefs)
                b = coefs[end]
                for i in length(coefs)-1:-1:1
                    b = coefs[i] + x * b
                end
                return b
            end
fhorner (generic function with 1 method)

julia v1.3> myhorner(x) = zero(x)
myhorner (generic function with 1 method)

julia v1.3> myhorner(x, p1) = p1
myhorner (generic function with 2 methods)

julia v1.3> myhorner(x, p1, ps...) = muladd(x, myhorner(x, ps...), p1)
myhorner (generic function with 3 methods)

julia v1.3> x = 10
10

julia v1.3> y = [1.0, 2.0, 3.0]
3-element Array{Float64,1}:
 1.0
 2.0
 3.0

julia v1.3> @btime fhorner($x, $y)
  9.183 ns (0 allocations: 0 bytes)
321.0

julia v1.3> @btime myhorner($x, $(y)...)
  608.847 ns (14 allocations: 240 bytes)
321.0

julia v1.3> 

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Jul 31, 2019

Splatting into horner will only be fast if the length of the container is known at compile time. For instance with tuples (or static arrays!):

horner(x) = zero(x)
horner(x, p1) = p1
horner(x, p1, ps...) = muladd(x, horner(x, ps...), p1)

f(x, v) = horner(x, v...)
g(x, p1, p2, p3) = horner(x, p1, p2, p3)

v = (1.0, 2.0, 3.0)

@btime f($(Ref(x))[], $v)            # 1.696 ns (0 allocations: 0 bytes)
@btime g($(Ref(x))[], 1.0, 2.0, 3.0) # 1.696 ns (0 allocations: 0 bytes)

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Jul 31, 2019

Another advantage of doing this as a function is that we could have something like horner(x, v::Vector) use the loop definition and horner(x, tup::Tuple) use the recursive inlining definition. Then there's a pretty simple API to switch between the definitions as needed.

Plus, people can add methods to horner as they need but not @horner.

@macd
Copy link
Contributor

macd commented Jul 31, 2019

Now that makes good sense.

@stevengj
Copy link
Member

stevengj commented Aug 1, 2019

If we add a function here, I would call it evalpoly rather than horner, and employ the Knuth algorithm for evalpoly(x::Complex, ...) similar to our @evalpoly macro.

See also #18410.

@stevengj stevengj added the domain:maths Mathematical functions label Aug 1, 2019
@StefanKarpinski
Copy link
Sponsor Member

I'm not sure I see the advantage to having functions for horner and evalpoly. The macros guarantee that the optimizations you want are done since they directly generate that code. While it's cool that the compiler is smart enough to do the same optimization, why force it to? Why not just use the macro? Are there the cases where the function is preferable?

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Aug 2, 2019

One case I showed above where the functions actually do something that the macros can’t do is stuff like

f(x, v::SVector) = evalpoly(x, v...)

Which is just as efficient as the macro. The crucial point here is that for horner to generate efficient code, I don’t actually need to know the polynomial coefficients ahead of time, I only need to know how many polynomial coefficients there should be.

If we’re worried about having to rely on inlining to generate efficient code, we could also just do this as a generated function. Then we get to guarantee that the loop is unrolled but have (most of) the advantages of functions like using type level information to dispatch and generate custom implementations.

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Sep 6, 2019

Hi all, sorry for the hiatus on this, I was away on vacation. I have made some overhauls:

  • Renamed the function horner -> evalpoly.
  • Switched from relying on constant propagation / inlining to using a generated function.
  • Implemented the Goertzel-like algorithm as a method: evalpoly(z::Complex, p...).

I also thought a bit more about why this makes sense as a function more than a macro. To clarify what I said before, the loop unrolling being done in @evalpoly really only relies on one piece of information: the number of polynomial coefficients. One could say that it is doing a syntax transformation, but I think it's actually more accurate to say that it is using type-level information: the length of the tuple of coeffs. In that sense, this functionality is much more the domain of generated functions.

In the @evalpoly source code, it uses an if-else to decide if z is real (in which case it uses horners method) or complex in which case it uses other code, so it already has a sort of fake internal dispatch going on inside. It's much nicer to use julia's built in dispatch and that way packages can extend it, like for instance if someone found a nice algorithm for evaluating quaternionic polynomials, it could easily be a method of evalpoly, but not @evalpoly.


Looking forward, my evalpoly(x, p...) is currently implemented using if @generated so that the compiler may choose to use a slower, looping definition if it needs to, but I have not yet done the same for evalpoly(z::Complex, p...). Would people prefer if evalpoly(z::Complex, p...) used if @generated or is a regular @generated function fine?

Perhaps I should instead get rid of the option to not use the @generated version of evalpoly(x, p...)?

@MasonProtter MasonProtter changed the title Add horner function Add evalpoly function Sep 6, 2019
@JeffreySarnoff
Copy link
Contributor

What remains to be done determined decided for merging a consensus into v1.4?

@musm
Copy link
Contributor

musm commented Nov 18, 2019

@MasonProtter your benchmark timings are meaningless since their sub ns to overcome the BenchmarkTools issue you need to do

@btime evalpoly1($(Ref(1.0))[], $(ntuple(n -> (-1)^n / n, 50)))
  48.683 ns (0 allocations: 0 bytes)
@btime evalpoly2($(Ref(1.0))[], $(ntuple(n -> (-1)^n / n, 50)))
  36.253 ns (0 allocations: 0 bytes)

In addition, code generation shows that f and g are identical.

@musm
Copy link
Contributor

musm commented Nov 18, 2019

@yuyichao I'm not arguing for or against evalpoly1 or evalpoly2, but my tests show that identical code is generated in both cases using your test example in #18410 (comment)

@musm
Copy link
Contributor

musm commented Nov 18, 2019

@MasonProtter upon closer thinking I'm in support of the PR as it currently stands. From my perspective it's better to guarantee unrolling in the static tuple case.

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Nov 18, 2019

@MasonProtter your benchmark timings are meaningless since their sub ns to overcome the BenchmarkTools issue you need to do

@btime evalpoly1($(Ref(1.0))[], $(ntuple(n -> (-1)^n / n, 50)))
  48.683 ns (0 allocations: 0 bytes)
@btime evalpoly2($(Ref(1.0))[], $(ntuple(n -> (-1)^n / n, 50)))
  36.253 ns (0 allocations: 0 bytes)

In addition, code generation shows that f and g are identical.

No, I don't believe they are meaningless. You're measuring the case where the compiler doesn't know x ahead of time. On the flip side, I was measuring the case where the compiler knows everything ahead of time and showed that the generated method was able to be completely hoisted out of the benchmark loop whereas the non-generated method wasn't. This can be advantageous in the cases where someone writes a completely static polynomial in their code like evalpoly(2.0, (1.0, 2.0, 3.0)) because it'll just be replaced with the value.

However, further testing I've done has shown that this process is fairly finicky and the difference in behaviours between the two methods is not always reliable. For instance,

julia> f() = evalpoly1(1.0, ntuple(n -> (-1)^n / n, 50))
f (generic function with 1 method)

julia> g() = evalpoly2(1.0, ntuple(n -> (-1)^n / n, 50))
g (generic function with 1 method)

julia> @btime f()
  2.305 μs (54 allocations: 1.77 KiB)
-0.6832471605759182

julia> @btime g()
  2.291 μs (55 allocations: 1.78 KiB)
-0.6832471605759182

shows basically no difference, though I'd have naïvely hoped that julia should be able to just lower both of these functions to just return -0.6832471605759182.

@musm
Copy link
Contributor

musm commented Nov 18, 2019

I guess the only question remaining here is if we should employ the trick to drop to the looping definition after a threshold.

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Nov 18, 2019

@musm I don't see a point where the looping method ever outperforms the generated method. Keep in mind that evalpoly1 was the non-generated method.

julia> function evalpoly1(x, p::Tuple)
           N = length(p)
           ex = p[end]
           for i in N-1:-1:1
               ex = muladd(x, ex, p[i])
           end
           ex
       end
evalpoly1 (generic function with 1 method)

julia> @generated function evalpoly2(x, p::Tuple)
           N = length(p.parameters)
           ex = :(p[end])
           for i in N-1:-1:1
               ex = :(muladd(x, $ex, p[$i]))
           end
           ex
       end
evalpoly2 (generic function with 1 method)

julia> for N in [1, 10, 30, 100, 200, 500, 1000]
           @show N
           print("  Non-generated: "); @btime evalpoly1($(Ref(1.0))[], $(ntuple(n -> (-1)^n / n, N)))
           print("  Generated:     "); @btime evalpoly2($(Ref(1.0))[], $(ntuple(n -> (-1)^n / n, N)))
       end
N = 1
  Non-generated:   0.029 ns (0 allocations: 0 bytes)
  Generated:       0.029 ns (0 allocations: 0 bytes)
N = 10
  Non-generated:   2.310 ns (0 allocations: 0 bytes)
  Generated:       2.305 ns (0 allocations: 0 bytes)
N = 30
  Non-generated:   18.259 ns (0 allocations: 0 bytes)
  Generated:       12.537 ns (0 allocations: 0 bytes)
N = 100
  Non-generated:   85.047 ns (0 allocations: 0 bytes)
  Generated:       61.601 ns (0 allocations: 0 bytes)
N = 200
  Non-generated:   196.571 ns (0 allocations: 0 bytes)
  Generated:       162.356 ns (0 allocations: 0 bytes)
N = 500
  Non-generated:   530.747 ns (0 allocations: 0 bytes)
  Generated:       495.928 ns (0 allocations: 0 bytes)
N = 1000
  Non-generated:   1.092 μs (0 allocations: 0 bytes)
  Generated:       1.061 μs (0 allocations: 0 bytes)

@musm
Copy link
Contributor

musm commented Nov 18, 2019

Agreed. Ok I think we should go ahead and merge this PR as is!

@JeffBezanson
Copy link
Sponsor Member

JeffBezanson commented Nov 18, 2019

Looks good to me. I don't know if it matters, but @horner used to always use Horner's rule, while now it will use the more efficient algorithm for complex. Should it do what it says on the tin?

@MasonProtter
Copy link
Contributor Author

@JeffBezanson Good catch. One way I could fix that is by making it macroexpand to something like invoke(evalpoly, Tuple{Any, Tuple}, x, p). Is that considered to be too ugly a thing to do here? I could also just write an inner generated function that always does Horner's rule.

@JeffBezanson
Copy link
Sponsor Member

invoke is fine with me. We went to all that trouble to optimize it after all :)

@musm
Copy link
Contributor

musm commented Nov 19, 2019

wing 64 test failure unrelated, otherwise LGTM

@musm
Copy link
Contributor

musm commented Nov 19, 2019

@MasonProtter could you confirm that some of the elementary math function, e.g. exp and exp10 don't see regressions with the updated @horner macro?

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Nov 19, 2019

I see only performance gains, at least relative to v.13.0-rc3.

julia> using Printf; BenchmarkTools

julia> begin
           @show VERSION
           for x in 10 .* randn(10)
               @show x
               print("  exp("*@sprintf("%.3f",x)*"):   "); @btime exp(  $(Ref(x))[])
               print("  exp10("*@sprintf("%.3f",x)*"): "); @btime exp10($(Ref(x))[])
           end
       end
VERSION = v"1.3.0-rc3.0"
x = -7.15243458415763
  exp(-7.152):     7.744 ns (0 allocations: 0 bytes)
  exp10(-7.152):   9.172 ns (0 allocations: 0 bytes)
x = -8.719494400795519
  exp(-8.719):     7.741 ns (0 allocations: 0 bytes)
  exp10(-8.719):   9.201 ns (0 allocations: 0 bytes)
x = -29.801956722722643
  exp(-29.802):     7.741 ns (0 allocations: 0 bytes)
  exp10(-29.802):   9.201 ns (0 allocations: 0 bytes)
x = -8.877111035670996
  exp(-8.877):     7.741 ns (0 allocations: 0 bytes)
  exp10(-8.877):   8.392 ns (0 allocations: 0 bytes)
x = -0.6928593337570201
  exp(-0.693):     6.449 ns (0 allocations: 0 bytes)
  exp10(-0.693):   9.200 ns (0 allocations: 0 bytes)
x = 2.866124556333614
  exp(2.866):     7.737 ns (0 allocations: 0 bytes)
  exp10(2.866):   8.117 ns (0 allocations: 0 bytes)
x = 23.658277524062772
  exp(23.658):     7.743 ns (0 allocations: 0 bytes)
  exp10(23.658):   9.196 ns (0 allocations: 0 bytes)
x = -8.531107335629256
  exp(-8.531):     7.748 ns (0 allocations: 0 bytes)
  exp10(-8.531):   9.203 ns (0 allocations: 0 bytes)
x = -1.907378473171837
  exp(-1.907):     7.748 ns (0 allocations: 0 bytes)
  exp10(-1.907):   9.199 ns (0 allocations: 0 bytes)
x = 12.353234804733408
  exp(12.353):     7.740 ns (0 allocations: 0 bytes)
  exp10(12.353):   9.198 ns (0 allocations: 0 bytes)
julia> using Printf; BenchmarkTools

julia> begin
           @show VERSION
           for x in 10 .* randn(10)
               @show x
               print("  exp("*@sprintf("%.3f",x)*"):   "); @btime exp(  $(Ref(x))[])
               print("  exp10("*@sprintf("%.3f",x)*"): "); @btime exp10($(Ref(x))[])
           end
       end
VERSION = v"1.4.0-DEV.539"
x = 3.0797159237964333
  exp(3.080):     7.720 ns (0 allocations: 0 bytes)
  exp10(3.080):   8.107 ns (0 allocations: 0 bytes)
x = 11.524550980616235
  exp(11.525):     7.727 ns (0 allocations: 0 bytes)
  exp10(11.525):   8.647 ns (0 allocations: 0 bytes)
x = -13.828224826793532
  exp(-13.828):     7.726 ns (0 allocations: 0 bytes)
  exp10(-13.828):   8.652 ns (0 allocations: 0 bytes)
x = 10.171399187874462
  exp(10.171):     7.727 ns (0 allocations: 0 bytes)
  exp10(10.171):   8.648 ns (0 allocations: 0 bytes)
x = -14.692467227301847
  exp(-14.692):     7.719 ns (0 allocations: 0 bytes)
  exp10(-14.692):   8.650 ns (0 allocations: 0 bytes)
x = 12.429415466997678
  exp(12.429):     7.725 ns (0 allocations: 0 bytes)
  exp10(12.429):   8.646 ns (0 allocations: 0 bytes)
x = 3.5732507205285162
  exp(3.573):     7.727 ns (0 allocations: 0 bytes)
  exp10(3.573):   8.649 ns (0 allocations: 0 bytes)
x = -5.010189256550127
  exp(-5.010):     7.727 ns (0 allocations: 0 bytes)
  exp10(-5.010):   8.648 ns (0 allocations: 0 bytes)
x = -4.809001460910974
  exp(-4.809):     7.724 ns (0 allocations: 0 bytes)
  exp10(-4.809):   8.653 ns (0 allocations: 0 bytes)
x = 19.73389331115626
  exp(19.734):     7.724 ns (0 allocations: 0 bytes)
  exp10(19.734):   8.651 ns (0 allocations: 0 bytes)

@JeffBezanson JeffBezanson merged commit 8f7855a into JuliaLang:master Nov 21, 2019
@musm musm mentioned this pull request Nov 21, 2019
@musm
Copy link
Contributor

musm commented Nov 21, 2019

Great! For 2.0 we should deprecate at-evalpoly.

In the meantime, we can replace all instances of @horner with evalpoly (there's only a handful of calls), and deprecate this unexported macro, since all of its usages are subsumed by the new evalpoly function.

@Keno Keno removed the status:triage This should be discussed on a triage call label Feb 27, 2020
KristofferC pushed a commit that referenced this pull request Apr 11, 2020
@chriselrod
Copy link
Contributor

chriselrod commented Aug 3, 2020

@oscardssmith

This caused a regression for some functions using SIMD intrinsics with llvmcall, because the evalpoly function doesn't inline due to overestimated costs.

julia> @inline function expb_kernel(::Val{ℯ}, x::Union{Float32, SVec{W, Float32} where W})
           return SLEEFPirates.@horner(x, 1.0f0, 1.0f0, 0.5f0,
                               0.16666415f0, 0.041666083f0,
                               0.008375129f0, 0.0013956056f0)
       end
 expb_kernel (generic function with 1 method)

julia> @btime myexp($(Ref(sx32))[]) # vector
  3.512 ns (0 allocations: 0 bytes)
 SVec{16,Float32}<16639.281f0, 1.8628123f0, 5.235295f0, 7.906962f-5, 1822.3082f0, 2.4061686f-9, 0.0030119985f0, 0.0002081529f0, 7.7609496f0, 23.372593f0, 8.717485f11, 3.6029082f6, 13742.452f0, 3.1532597f-9, 0.007988205f0, 0.00020794437f0>

julia> @inline function expb_kernel(::Val{ℯ}, x::Union{Float32, SVec{W, Float32} where W})
           return evalpoly(x, (1.0f0, 1.0f0, 0.5f0,
                               0.16666415f0, 0.041666083f0,
                               0.008375129f0, 0.0013956056f0))
       end
 expb_kernel (generic function with 1 method)

julia> @btime myexp($(Ref(sx32))[]) # vector
  4.975 ns (0 allocations: 0 bytes)
 SVec{16,Float32}<16639.281f0, 1.8628123f0, 5.235295f0, 7.906962f-5, 1822.3082f0, 2.4061686f-9, 0.0030119985f0, 0.0002081529f0, 7.7609496f0, 23.372593f0, 8.717485f11, 3.6029082f6, 13742.452f0, 3.1532597f-9, 0.007988205f0, 0.00020794437f0>

Costs:

julia> function cost(f, tt) # Thanks to Tim Holy; note this function needs updating for Julia 1.6
           params = Core.Compiler.Params(typemax(UInt))
           mi = Base.method_instances(f, tt)[1]
           ci = code_typed(f, tt)[1][1]
           opt = Core.Compiler.OptimizationState(mi, params)
           cost(stmt::Expr) = Core.Compiler.statement_cost(stmt, -1, ci, opt.sptypes, opt.slottypes, opt.params)
           cost(stmt) = 0
           sum(cost, ci.code)
       end
cost (generic function with 1 method)

julia> cost(+, Tuple{SVec{16,Float32},SVec{16,Float32}})
10

julia> cost(+, Tuple{Float32,Float32})
1

julia> cost(+, Tuple{SVec{8,Float32},SVec{8,Float32}})
10

julia> cost(muladd, Tuple{SVec{8,Float32},SVec{8,Float32},SVec{8,Float32}})
10

julia> cost(muladd, Tuple{Float32,Float32,Float32})
5

On my computer (with AVX512), all these costs should be 1. On Haswell, the costs normalized relative to muladd should be: 4, 2, 2, 1, 1.
Of course, Julia can't introspect into the llvmcalls.

@KristofferC
Copy link
Sponsor Member

Ref #32753 (comment)

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Aug 3, 2020

Note that comment was in regards to the original implementation of this PR that relied on recursion, rather than a generated function. There is no reliance on ‘12 steps of inlining’ here unless the compiler decides to not use the if @generated

@stevengj
Copy link
Member

stevengj commented Dec 1, 2020

Do we still need a generated function for this?

I noticed recently that the compiler seems able to unroll/inline Base.Math._evalpoly(x, sometuple) (the loopy version). I haven't found any case yet where the @generated version produces better code.

@LilithHafner
Copy link
Member

For code generation, no, but for effect inference, unfortunately, yes

julia> Base.infer_effects(evalpoly, (Float64, Tuple{Float64, Float64, Float64}))
(+c,+e,+n,+t,+s)

julia> Base.infer_effects(Base.Math._evalpoly, (Float64, Tuple{Float64, Float64, Float64}))
(!c,+e,!n,!t,+s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:maths Mathematical functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet