Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving broadcasting performance by working around recursion limits of inlining #41090

Closed
mateuszbaran opened this issue Jun 4, 2021 · 7 comments
Labels
domain:broadcast Applying a function over a collection performance Must go faster

Comments

@mateuszbaran
Copy link
Contributor

Hi!

I've discovered that many (runtime) performance issues with broadcasting are caused by inlining not working with the highly recursive broadcasting code. It turns out that defining more methods can actually help here. Here is a piece of code you can evaluate in REPL to see that:

struct RecursiveInliningEnforcerA{T}
    makeargs::T
end

struct RecursiveInliningEnforcerB{TMT,TMH,TT,TH,TF}
    makeargs_tail::TMT
    makeargs_head::TMH
    headargs::TH
    tailargs::TT
    f::TF
end
for UB in [Any, RecursiveInliningEnforcerA]
    @eval @inline function (bb::RecursiveInliningEnforcerB{TMT,TMH,TT,TH,TF})(args::Vararg{Any,N}) where {N,TMT,TMH<:$UB,TT,TH,TF}
        args1 = bb.makeargs_head(args...)
        a = bb.headargs(args1...)
        b = bb.makeargs_tail(bb.tailargs(args1...)...)
        return (bb.f(a...), b...)
    end
end

for UB in [Any, RecursiveInliningEnforcerB]
    @eval @inline (a::RecursiveInliningEnforcerA{TTA})(head::TH, tail::Vararg{Any,N}) where {TTA<:$UB,TH,N} = (head, a.makeargs(tail...)...)
end

@inline function Broadcast.make_makeargs(makeargs_tail::TT, t::Tuple) where TT
    return RecursiveInliningEnforcerA(Broadcast.make_makeargs(makeargs_tail, Base.tail(t)))
end

function Broadcast.make_makeargs(makeargs_tail, t::Tuple{<:Broadcast.Broadcasted, Vararg{Any}})
    bc = t[1]
    # c.f. the same expression in the function on leaf nodes above. Here
    # we recurse into siblings in the broadcast tree.
    let makeargs_tail = Broadcast.make_makeargs(makeargs_tail, Base.tail(t)),
            # Here we recurse into children. It would be valid to pass in makeargs_tail
            # here, and not use it below. However, in that case, our recursion is no
            # longer purely structural because we're building up one argument (the closure)
            # while destructuing another.
            makeargs_head = Broadcast.make_makeargs((args...)->args, bc.args),
            f = bc.f
        # Create two functions, one that splits of the first length(bc.args)
        # elements from the tuple and one that yields the remaining arguments.
        # N.B. We can't call headargs on `args...` directly because
        # args is flattened (i.e. our children have not been evaluated
        # yet).
        headargs, tailargs = Broadcast.make_headargs(bc.args), Broadcast.make_tailargs(bc.args)
        return RecursiveInliningEnforcerB(makeargs_tail, makeargs_head, headargs, tailargs, f)
    end
end

This effectively duplicates these two functions:

julia/base/broadcast.jl

Lines 380 to 384 in abbb220

return @inline function(args::Vararg{Any,N}) where N
args1 = makeargs_head(args...)
a, b = headargs(args1...), makeargs_tail(tailargs(args1...)...)
(f(a...), b...)
end
and
(head, tail...)->(head, makeargs(tail...)...)
for different argument types.

It turns out that it's sufficient to fix the following issues:
JuliaArrays/StaticArrays.jl#560
JuliaArrays/StaticArrays.jl#682
JuliaArrays/StaticArrays.jl#609
JuliaArrays/StaticArrays.jl#797

What do you think about it?

@oscardssmith
Copy link
Member

What effect does this have on compilation time?

@mateuszbaran
Copy link
Contributor Author

Compilation time of what?

@mateuszbaran
Copy link
Contributor Author

mateuszbaran commented Jun 4, 2021

I just tried measuring compilation time for one of this issues (JuliaArrays/StaticArrays.jl#560), before:

julia> @time f(ũ, u₀, u₁, ρ)
  0.377140 seconds (831.70 k allocations: 51.208 MiB, 2.14% gc time, 99.99% compilation time)
3-element SVector{3, Float64} with indices SOneTo(3):
 0.2167203902566349
 0.8064882455742344
 4.013494934265736

after:

julia> @time f(ũ, u₀, u₁, ρ)
  0.266012 seconds (720.18 k allocations: 44.344 MiB, 2.57% gc time, 99.99% compilation time)
3-element SVector{3, Float64} with indices SOneTo(3):
 0.49720047062352407
 1.074241639114959
 4.084778907675536

So it seems to even improve compilation time in some cases.

@mbauman mbauman added domain:broadcast Applying a function over a collection performance Must go faster labels Jun 4, 2021
@mateuszbaran
Copy link
Contributor Author

A few more benchmarks (Julia master Version 1.7.0-DEV.1255 (2021-06-05)). Note that compile times after the change were measured after fresh start of Julia REPL so that nothing from "before" evaluations was cached.

Overall I haven't encountered a single case where either compile time or runtime was worse after this change, and there are some fairly decent improvements. The only drawback is slightly more complicated broadcasting code.

Issue JuliaArrays/StaticArrays.jl#560

runtime before:

julia> @btime f($ũ, $u₀, $u₁, $ρ) # 2.354 μs (44 allocations: 3.59 KiB)
  65.892 ns (12 allocations: 192 bytes)

compile time before:

julia> @time f(ũ, u₀, u₁, ρ) # 2.354 μs (44 allocations: 3.59 KiB)
  0.379381 seconds (871.05 k allocations: 55.635 MiB, 4.07% gc time, 100.00% compilation time)

runtime after

julia> @btime f($ũ, $u₀, $u₁, $ρ) # 2.354 μs (44 allocations: 3.59 KiB)
  0.020 ns (0 allocations: 0 bytes)

compile time after (I'm not sure why I can't reproduce the compile time reduction today now but at least it's not worse)

julia> @time f(ũ, u₀, u₁, ρ) # 2.354 μs (44 allocations: 3.59 KiB)
  0.379410 seconds (923.10 k allocations: 58.787 MiB, 4.11% gc time, 100.00% compilation time)

Issue JuliaArrays/StaticArrays.jl#682

runtime before:

julia> @btime f($A, $b, $A, $(_dual(b)), $b, $A)
  12.153 μs (310 allocations: 7.34 KiB)

compile time before:

julia> @time f(A, b, A, _dual(b), b, A)
  0.666930 seconds (1.53 M allocations: 97.502 MiB, 4.58% gc time, 99.91% compilation time)

runtime after:

julia> @btime f($A, $b, $A, $(_dual(b)), $b, $A)
  0.020 ns (0 allocations: 0 bytes)

compile time after (small improvement):

julia> @time f(A, b, A, _dual(b), b, A)
  0.558779 seconds (1.47 M allocations: 94.353 MiB, 7.04% gc time, 100.00% compilation time)

Issue JuliaArrays/StaticArrays.jl#609

runtime before:

julia> @btime doit($s,$c)
  1.193 μs (24 allocations: 448 bytes)

compile time before:

julia> @time doit(s,c)
  0.394210 seconds (805.74 k allocations: 52.276 MiB, 99.97% compilation time)
1-element SVector{1, Float64} with indices SOneTo(1):

runtime after:

julia> @btime doit($s,$c)
  0.020 ns (0 allocations: 0 bytes)

compile time after (small improvement):

julia> @time doit(s,c)
  0.351852 seconds (761.63 k allocations: 49.821 MiB, 3.27% gc time, 100.00% compilation time)

Issue JuliaArrays/StaticArrays.jl#797

runtime before:

julia> @btime g!($rs, $as, $bs, $cs, $ds)
  17.343 μs (750 allocations: 13.28 KiB)

compile time before:

julia> @time g!(rs, as, bs, cs, ds)
  0.716501 seconds (2.49 M allocations: 128.211 MiB, 11.63% gc time)

runtime after:

julia> @btime g!($rs, $as, $bs, $cs, $ds)
  20.882 ns (0 allocations: 0 bytes)

compile time after (decent improvement):

julia> @time g!(rs, as, bs, cs, ds)
  0.534649 seconds (2.71 M allocations: 139.501 MiB, 4.61% gc time)

Unrelated broadcasting example

julia> a = [1.0 2; 3 4];

julia> b = [1.0, 2.0];

julia> f(a, b) = a .+ 2 .* b;

compile time before:

julia> @time f(a, b)
  0.108545 seconds (619.70 k allocations: 36.861 MiB, 99.98% compilation time)
2×2 Matrix{Float64}:

compile time after:

julia> @time f(a, b)
  0.108584 seconds (624.50 k allocations: 37.268 MiB, 99.99% compilation time)
2×2 Matrix{Float64}:

@mateuszbaran
Copy link
Contributor Author

Should I open a PR with this change or is it not suitable for Julia Base?

@KristofferC
Copy link
Sponsor Member

I think a PR would show more clearly what your proposed changes are and thereby lead to a better discussion.

@mateuszbaran
Copy link
Contributor Author

OK, I've opened a PR: #41139 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:broadcast Applying a function over a collection performance Must go faster
Projects
None yet
Development

No branches or pull requests

4 participants