Improving broadcasting performance by working around recursion limits of inlining #41090

mateuszbaran · 2021-06-04T17:12:07Z

Hi!

I've discovered that many (runtime) performance issues with broadcasting are caused by inlining not working with the highly recursive broadcasting code. It turns out that defining more methods can actually help here. Here is a piece of code you can evaluate in REPL to see that:

struct RecursiveInliningEnforcerA{T}
    makeargs::T
end

struct RecursiveInliningEnforcerB{TMT,TMH,TT,TH,TF}
    makeargs_tail::TMT
    makeargs_head::TMH
    headargs::TH
    tailargs::TT
    f::TF
end
for UB in [Any, RecursiveInliningEnforcerA]
    @eval @inline function (bb::RecursiveInliningEnforcerB{TMT,TMH,TT,TH,TF})(args::Vararg{Any,N}) where {N,TMT,TMH<:$UB,TT,TH,TF}
        args1 = bb.makeargs_head(args...)
        a = bb.headargs(args1...)
        b = bb.makeargs_tail(bb.tailargs(args1...)...)
        return (bb.f(a...), b...)
    end
end

for UB in [Any, RecursiveInliningEnforcerB]
    @eval @inline (a::RecursiveInliningEnforcerA{TTA})(head::TH, tail::Vararg{Any,N}) where {TTA<:$UB,TH,N} = (head, a.makeargs(tail...)...)
end

@inline function Broadcast.make_makeargs(makeargs_tail::TT, t::Tuple) where TT
    return RecursiveInliningEnforcerA(Broadcast.make_makeargs(makeargs_tail, Base.tail(t)))
end

function Broadcast.make_makeargs(makeargs_tail, t::Tuple{<:Broadcast.Broadcasted, Vararg{Any}})
    bc = t[1]
    # c.f. the same expression in the function on leaf nodes above. Here
    # we recurse into siblings in the broadcast tree.
    let makeargs_tail = Broadcast.make_makeargs(makeargs_tail, Base.tail(t)),
            # Here we recurse into children. It would be valid to pass in makeargs_tail
            # here, and not use it below. However, in that case, our recursion is no
            # longer purely structural because we're building up one argument (the closure)
            # while destructuing another.
            makeargs_head = Broadcast.make_makeargs((args...)->args, bc.args),
            f = bc.f
        # Create two functions, one that splits of the first length(bc.args)
        # elements from the tuple and one that yields the remaining arguments.
        # N.B. We can't call headargs on `args...` directly because
        # args is flattened (i.e. our children have not been evaluated
        # yet).
        headargs, tailargs = Broadcast.make_headargs(bc.args), Broadcast.make_tailargs(bc.args)
        return RecursiveInliningEnforcerB(makeargs_tail, makeargs_head, headargs, tailargs, f)
    end
end

This effectively duplicates these two functions:

julia/base/broadcast.jl

Lines 380 to 384 in abbb220

 return @inline function(args::Vararg{Any,N}) where N 

 args1 = makeargs_head(args...) 

 a, b = headargs(args1...), makeargs_tail(tailargs(args1...)...) 

 (f(a...), b...) 

 end

and

julia/base/broadcast.jl

Line 361 in abbb220

(head, tail...)->(head, makeargs(tail...)...)

for different argument types.

It turns out that it's sufficient to fix the following issues:
JuliaArrays/StaticArrays.jl#560
JuliaArrays/StaticArrays.jl#682
JuliaArrays/StaticArrays.jl#609
JuliaArrays/StaticArrays.jl#797

What do you think about it?

oscardssmith · 2021-06-04T18:19:36Z

What effect does this have on compilation time?

mateuszbaran · 2021-06-04T18:36:03Z

Compilation time of what?

mateuszbaran · 2021-06-04T18:45:06Z

I just tried measuring compilation time for one of this issues (JuliaArrays/StaticArrays.jl#560), before:

julia> @time f(ũ, u₀, u₁, ρ)
  0.377140 seconds (831.70 k allocations: 51.208 MiB, 2.14% gc time, 99.99% compilation time)
3-element SVector{3, Float64} with indices SOneTo(3):
 0.2167203902566349
 0.8064882455742344
 4.013494934265736

after:

julia> @time f(ũ, u₀, u₁, ρ)
  0.266012 seconds (720.18 k allocations: 44.344 MiB, 2.57% gc time, 99.99% compilation time)
3-element SVector{3, Float64} with indices SOneTo(3):
 0.49720047062352407
 1.074241639114959
 4.084778907675536

So it seems to even improve compilation time in some cases.

mateuszbaran · 2021-06-07T10:20:07Z

A few more benchmarks (Julia master Version 1.7.0-DEV.1255 (2021-06-05)). Note that compile times after the change were measured after fresh start of Julia REPL so that nothing from "before" evaluations was cached.

Overall I haven't encountered a single case where either compile time or runtime was worse after this change, and there are some fairly decent improvements. The only drawback is slightly more complicated broadcasting code.

Issue JuliaArrays/StaticArrays.jl#560

runtime before:

julia> @btime f($ũ, $u₀, $u₁, $ρ) # 2.354 μs (44 allocations: 3.59 KiB)
  65.892 ns (12 allocations: 192 bytes)

compile time before:

julia> @time f(ũ, u₀, u₁, ρ) # 2.354 μs (44 allocations: 3.59 KiB)
  0.379381 seconds (871.05 k allocations: 55.635 MiB, 4.07% gc time, 100.00% compilation time)

runtime after

julia> @btime f($ũ, $u₀, $u₁, $ρ) # 2.354 μs (44 allocations: 3.59 KiB)
  0.020 ns (0 allocations: 0 bytes)

compile time after (I'm not sure why I can't reproduce the compile time reduction today now but at least it's not worse)

julia> @time f(ũ, u₀, u₁, ρ) # 2.354 μs (44 allocations: 3.59 KiB)
  0.379410 seconds (923.10 k allocations: 58.787 MiB, 4.11% gc time, 100.00% compilation time)

Issue JuliaArrays/StaticArrays.jl#682

runtime before:

julia> @btime f($A, $b, $A, $(_dual(b)), $b, $A)
  12.153 μs (310 allocations: 7.34 KiB)

compile time before:

julia> @time f(A, b, A, _dual(b), b, A)
  0.666930 seconds (1.53 M allocations: 97.502 MiB, 4.58% gc time, 99.91% compilation time)

runtime after:

julia> @btime f($A, $b, $A, $(_dual(b)), $b, $A)
  0.020 ns (0 allocations: 0 bytes)

compile time after (small improvement):

julia> @time f(A, b, A, _dual(b), b, A)
  0.558779 seconds (1.47 M allocations: 94.353 MiB, 7.04% gc time, 100.00% compilation time)

Issue JuliaArrays/StaticArrays.jl#609

runtime before:

julia> @btime doit($s,$c)
  1.193 μs (24 allocations: 448 bytes)

compile time before:

julia> @time doit(s,c)
  0.394210 seconds (805.74 k allocations: 52.276 MiB, 99.97% compilation time)
1-element SVector{1, Float64} with indices SOneTo(1):

runtime after:

julia> @btime doit($s,$c)
  0.020 ns (0 allocations: 0 bytes)

compile time after (small improvement):

julia> @time doit(s,c)
  0.351852 seconds (761.63 k allocations: 49.821 MiB, 3.27% gc time, 100.00% compilation time)

Issue JuliaArrays/StaticArrays.jl#797

runtime before:

julia> @btime g!($rs, $as, $bs, $cs, $ds)
  17.343 μs (750 allocations: 13.28 KiB)

compile time before:

julia> @time g!(rs, as, bs, cs, ds)
  0.716501 seconds (2.49 M allocations: 128.211 MiB, 11.63% gc time)

runtime after:

julia> @btime g!($rs, $as, $bs, $cs, $ds)
  20.882 ns (0 allocations: 0 bytes)

compile time after (decent improvement):

julia> @time g!(rs, as, bs, cs, ds)
  0.534649 seconds (2.71 M allocations: 139.501 MiB, 4.61% gc time)

Unrelated broadcasting example

julia> a = [1.0 2; 3 4];

julia> b = [1.0, 2.0];

julia> f(a, b) = a .+ 2 .* b;

compile time before:

julia> @time f(a, b)
  0.108545 seconds (619.70 k allocations: 36.861 MiB, 99.98% compilation time)
2×2 Matrix{Float64}:

compile time after:

julia> @time f(a, b)
  0.108584 seconds (624.50 k allocations: 37.268 MiB, 99.99% compilation time)
2×2 Matrix{Float64}:

mateuszbaran · 2021-06-08T09:33:34Z

Should I open a PR with this change or is it not suitable for Julia Base?

KristofferC · 2021-06-08T10:39:39Z

I think a PR would show more clearly what your proposed changes are and thereby lead to a better discussion.

mateuszbaran · 2021-06-08T20:54:04Z

OK, I've opened a PR: #41139 .

mbauman added domain:broadcast Applying a function over a collection performance Must go faster labels Jun 4, 2021

mateuszbaran mentioned this issue Jun 8, 2021

Improving broadcasting performance by working around recursion limits of inlining #41139

Closed

mateuszbaran closed this as completed Jul 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving broadcasting performance by working around recursion limits of inlining #41090

Improving broadcasting performance by working around recursion limits of inlining #41090

mateuszbaran commented Jun 4, 2021

oscardssmith commented Jun 4, 2021

mateuszbaran commented Jun 4, 2021

mateuszbaran commented Jun 4, 2021 •

edited

Loading

mateuszbaran commented Jun 7, 2021

mateuszbaran commented Jun 8, 2021

KristofferC commented Jun 8, 2021

mateuszbaran commented Jun 8, 2021

Improving broadcasting performance by working around recursion limits of inlining #41090

Improving broadcasting performance by working around recursion limits of inlining #41090

Comments

mateuszbaran commented Jun 4, 2021

oscardssmith commented Jun 4, 2021

mateuszbaran commented Jun 4, 2021

mateuszbaran commented Jun 4, 2021 • edited Loading

mateuszbaran commented Jun 7, 2021

Issue JuliaArrays/StaticArrays.jl#560

Issue JuliaArrays/StaticArrays.jl#682

Issue JuliaArrays/StaticArrays.jl#609

Issue JuliaArrays/StaticArrays.jl#797

Unrelated broadcasting example

mateuszbaran commented Jun 8, 2021

KristofferC commented Jun 8, 2021

mateuszbaran commented Jun 8, 2021

mateuszbaran commented Jun 4, 2021 •

edited

Loading