Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing performance #11595

Closed
mbauman opened this issue Jun 5, 2015 · 32 comments
Closed

Indexing performance #11595

mbauman opened this issue Jun 5, 2015 · 32 comments
Labels
performance Must go faster regression Regression in behavior compared to a previous version

Comments

@mbauman
Copy link
Member

mbauman commented Jun 5, 2015

Somewhere within a rebase during the month of May, I lost some of the performance I had been expecting to see in #10525. I had been tracking the output of test/perf/array/perf.jl over the life of the issue, and periodically creating new branches locally as I rebased to keep up with master. So I was able to go back in time to a few snapshots and look at some key metrics from the perf test. Here's a quick dump of what I found. I won't be able to spend more time on this today, but I will try to resolve this over the weekend.

julia,sumeltIs ArrayLS,6.971857,12.938212,8.154441,1.144733
julia,sumeachIs ArrayLS,3.964500,8.134745,4.595888,0.732331
julia,sumlinearIs ArrayLS,28.147964,37.045052,29.551360,2.039010
  • mb/abstractsmarts10-updated is a straightforward rebase of that branch to master-just-before-10525-was-merged. It's a little different from master (it won't pass the tests, for one), but the performance is very similar. I wanted to eliminate any possible difference.
julia,sumeltIs ArrayLS,30.437162,40.237688,32.703086,2.072026
julia,sumeachIs ArrayLS,25.994786,43.018392,29.987360,3.992816
julia,sumlinearIs ArrayLS,3.178545,7.702495,4.153768,0.829394 # This is wild

Sorry for the false alarm @timholy, the commit I flagged in #5117 was totally unrelated and simply due to some dirty files I had. It's something else.

@mbauman
Copy link
Member Author

mbauman commented Jun 5, 2015

Anyone know how to do an automated git bisect of a rebase? Perhaps using git rerere to automate the conflict resolution?

@pao
Copy link
Member

pao commented Jun 5, 2015

Create a test branch with your commit reordered early into the history, back when you originally proposed it? Effectively this would rebase everything since on top of #10525, then you can bisect that branch.

@mbauman
Copy link
Member Author

mbauman commented Jun 6, 2015

Ah, yes, that was the trick. Thanks for the idea, @pao. This answer makes a lot more sense: 677f537 is the first bad commit

switch inlining costs to use scaled integer math. this removes the need for floating point numbers in inference.

I'll take a look at it more in depth tomorrow.

@vtjnash
Copy link
Member

vtjnash commented Jun 6, 2015

unless i made a math error, that commit is strictly switching from floating point math to scaled integer math (with a scale factor of 1000), and should not have significantly changed the thresholds.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 6, 2015

@yuyichao
Copy link
Contributor

yuyichao commented Jun 6, 2015

Ahh. Github doesn't automatically expand the diff....

Let's try again.

if ((occ==0 && is(aeitype,Bottom)) || islocal || (occ > 1 && !inline_worthy(aei, occ*2)) ||

@vtjnash

@vtjnash
Copy link
Member

vtjnash commented Jun 6, 2015

@yuyichao i'm not sure what you are pointing at on that line, since it doesn't deal with inlining costs nor did it change in the bisect commit mentioned above

@yuyichao
Copy link
Contributor

yuyichao commented Jun 6, 2015

That is the only call to inline_worthy that you haven't touched. The parameter passed in is not scaled by 1000. In case something goes wrong, here's the line

 if ((occ==0 && is(aeitype,Bottom)) || islocal || (occ > 1 && !inline_worthy(aei, occ*2)) ||

Edit: In another word

diff --git a/base/inference.jl b/base/inference.jl
index f03d3aa..d6f3a81 100644
--- a/base/inference.jl
+++ b/base/inference.jl
@@ -2570,7 +2570,7 @@ function inlineable(f::ANY, e::Expr, atype::ANY, sv::StaticVarInfo, enclosing_as
             end
         end
         free = effect_free(aei,sv,true)
-        if ((occ==0 && is(aeitype,Bottom)) || islocal || (occ > 1 && !inline_worthy(aei, occ*2)) ||
+        if ((occ==0 && is(aeitype,Bottom)) || islocal || (occ > 1 && !inline_worthy(aei, occ*2000)) ||
                 (affect_free && !free) || (!affect_free && !effect_free(aei,sv,false)))
             if occ != 0 # islocal=true is implied by occ!=0
                 if !islocal

@vtjnash
Copy link
Member

vtjnash commented Jun 6, 2015

oops, good catch -- yes, i didn't even see that one

@mbauman
Copy link
Member Author

mbauman commented Jun 6, 2015

Ah, wonderful. Thanks guys. That helps a lot. It's still not 100%, but definitely getting closer. On abstractsmarts10-updated + the above patch:

julia,sumeltIs ArrayLS,12.441170,18.845323,14.019387,1.576552
julia,sumeachIs ArrayLS,9.144515,15.143982,10.485381,1.488682
julia,sumlinearIs ArrayLS,28.358809,37.302391,33.664462,2.070900

Inlining is crazy difficult to get right.

@timholy
Copy link
Member

timholy commented Jun 6, 2015

The memory consumption reported in #5117 (comment) is also consistent with a splatting penalty.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 6, 2015

One way to avoid the splatting penalty after #10380 is to use tuple as argument directly instead of using it as vararg argument. (See script and output below).

I'm not sure if I would recommand this as a long term solution though because IMHO the vararg case will be as fast (#11248) or even better (#5402) than the tuple case in the future.

#!/usr/bin/julia -f

using Base: unsafe_getindex, checkbounds

macro time_func(f, args...)
    args = eval(current_module(), Expr(:tuple, args...))::Tuple
    argnames = Symbol[gensym() for i in 1:length(args)]
    types = map(typeof, args)
    quote
        function wrapper($(argnames...))
            $(Expr(:meta, :noinline))
            $f($(argnames...))
        end
        function timing_wrapper()
            println($f, $types)
            wrapper($(args...))
            gc()
            @time for i in 1:10000000
                wrapper($(args...))
            end
            # Profile.print()
            gc()
        end
        timing_wrapper()
    end
end

function checkbounds2{T<:Union(Real,AbstractArray,Colon)}(A::AbstractArray,
                                                          I::Tuple{Vararg{T}})
    Base.@_inline_meta
    n = length(I)
    if n > 0
        for dim = 1:(n-1)
            (Base._checkbounds(size(A,dim), I[dim]) ||
             Base.throw_boundserror(A, I))
        end
        (Base._checkbounds(Base.trailingsize(A,n), I[n]) ||
         Base.throw_boundserror(A, I))
    end
end

@inline function getindex2(V::SubArray, I::Int...)
    checkbounds2(V, I)
    unsafe_getindex(V, I...)
end

function checkbounds3{T<:Union(Real,AbstractArray,Colon)}(A::AbstractArray,
                                                          I::T...)
    Base.@_inline_meta
    n = length(I)
    if n > 0
        for dim = 1:(n-1)
            (Base._checkbounds(size(A,dim), I[dim]) ||
             Base.throw_boundserror(A, I))
        end
        (Base._checkbounds(Base.trailingsize(A,n), I[n]) ||
         Base.throw_boundserror(A, I))
    end
end

@inline function getindex3(V::SubArray, I::Int...)
    checkbounds3(V, I...)
    unsafe_getindex(V, I...)
end

s1 = sub(Float64[1, 2], 1:2)

# @code_llvm getindex2(s1, 1)
# @code_llvm getindex(s1, 1)

@time_func(getindex3, s1, 1)
@time_func(getindex2, s1, 1)
@time_func(getindex, s1, 1)
getindex3(SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},1},Int64)
   1.560 seconds      (10000 k allocations: 153 MB, 1.13% gc time)
getindex2(SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},1},Int64)
  88.030 milliseconds
getindex(SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},1},Int64)
   1.473 seconds      (10000 k allocations: 153 MB, 0.22% gc time)

@timholy
Copy link
Member

timholy commented Jun 6, 2015

I've dug into this more, applying @yuyichao's patch and also editing checkbounds to always return nothing.

julia> s1 = sub(Float64[1, 2], 1:2)
2-element SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},1}:
 1.0
 2.0

julia> f(s) = s[1]
f (generic function with 1 method)

julia> @code_warntype f(s1)
Variables:
  s::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},1}

Body:
  begin  # none, line 1:
      checkbounds(s::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},1},1)::Void
      return (top(arrayref))((top(getfield))(s::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},1},:parent)::Array{Float64,1},(top(box))(Int64,(top(sub_int))((top(box))(Int64,(top(add_int))((top(getfield))(s::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},1},:first_index)::Int64,1)),1)))::Float64
  end::Float64

No more type-instability, of course. But notice the call to checkbounds is not inlined.

Using HackThatBase, I discovered the reason: something, somewhere, is inserting a NewvarNode before the meta expression (output of linfo and atypes from within typeinf, indentation of the variable indicates backtrace depth):

                                                      linfo = AST(:($(Expr(:lambda, Any[:A,:(I::Union(Real,AbstractArray,Colon)...)], Any[Any[:n,symbol("#s17"),symbol("#s19"),:dim,symbol("#s18")],Any[Any[:A,:Any,0],Any[:I,:Any,0],Any[:n,:Any,18],Any[symbol("#s17"),:Any,18],Any[symbol("#s19"),:Any,2],Any[:dim,:Any,18],Any[symbol("#s18"),:Any,18]],Any[],2], :(begin  # abstractarray.jl, line 162:
        NewvarNode(symbol("#s17"))
        $(Expr(:meta, :inline)) # line 163:
        n = length(I) # line 164:
        unless n > 0 goto 7 # line 165:
        GenSym(0) = colon(1,n - 1)
        #s19 = (top(start))(GenSym(0))
        unless (top(!))((top(done))(GenSym(0),#s19)) goto 1
        2: 
        GenSym(1) = (top(next))(GenSym(0),#s19)
        dim = (top(getfield))(GenSym(1),1)
        #s19 = (top(getfield))(GenSym(1),2) # line 166:
        #s18 = _checkbounds(size(A,dim),getindex(I,dim))
        unless #s18 goto 4
        goto 5
        4: 
        throw_boundserror(A,I)
        5: 
        3: 
        unless (top(!))((top(!))((top(done))(GenSym(0),#s19))) goto 2
        1: 
        0:  # line 168:
        #s17 = _checkbounds(trailingsize(A,n),getindex(I,n))
        unless #s17 goto 6
        goto 7
        6: 
        throw_boundserror(A,I)
        7:  # line 170:
        return nothing
    end)))))
                                                      atypes = Tuple{MyArray{Float64,1},Int64}
                                                      (in typeinf at none, line 1294)

Any ideas where this is happening, @vtjnash?

@yuyichao
Copy link
Contributor

yuyichao commented Jun 6, 2015

Sounds like the same issue here #10980 (comment) - - ....

@yuyichao
Copy link
Contributor

yuyichao commented Jun 6, 2015

I've also tried to manually inline the call and it appears to be slower than the tuple argument version here.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 6, 2015

With the following method added. full script

@inline function getindex4(V::SubArray, I::Int...)
    n = length(I)
    if n > 0
        for dim = 1:(n - 1)
            (Base._checkbounds(size(V, dim), I[dim]) ||
             Base.throw_boundserror(V, I))
        end
        (Base._checkbounds(Base.trailingsize(V, n), I[n]) ||
         Base.throw_boundserror(V, I))
    end
    unsafe_getindex(V, I...)
end

Output:

getindex4(SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},1},Int64)
   2.030 seconds      (30000 k allocations: 458 MB, 1.31% gc time)
getindex3(SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},1},Int64)
   1.414 seconds      (10000 k allocations: 153 MB, 0.22% gc time)
getindex2(SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},1},Int64)
  80.872 milliseconds
getindex(SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},1},Int64)
   1.116 seconds      (10000 k allocations: 153 MB, 0.28% gc time)

@timholy
Copy link
Member

timholy commented Jun 6, 2015

Once we solve the inlining problem, all will be golden. It worked fine just a short while ago.

@mbauman
Copy link
Member Author

mbauman commented Jun 6, 2015

something, somewhere, is inserting a NewvarNode before the meta expression

Ah, very nice. I'll run another bisect for that.

@timholy
Copy link
Member

timholy commented Jun 6, 2015

Suspected ebfebfd, CC @JeffBezanson

@mbauman
Copy link
Member Author

mbauman commented Jun 6, 2015

Yup, confirmed. (Bisects are so fast when there are only two commits to test!)

@mbauman
Copy link
Member Author

mbauman commented Jun 6, 2015

Well, I tried playing around in the frontend to ensure meta always stays on top, but I am not a lisper. It'll have to wait for someone with better lisp chops than me.

@JeffBezanson JeffBezanson added performance Must go faster regression Regression in behavior compared to a previous version labels Jun 8, 2015
@JeffBezanson
Copy link
Member

I think we'll have to be more robust about looking for metadata. Otherwise multiple features will compete for that top spot in the function body.

timholy added a commit that referenced this issue Jun 8, 2015
This fixes the primary problem in #11595, but is not enough on its
own to recapture full performance. Interestingly, this does not appear to
cause a measurable slowing of the compilation of base.
timholy added a commit that referenced this issue Jun 8, 2015
This address most of the remaining problem in #11595, but it's still about
three-fold short of the performance we had before getindex(::SubArray, ...)
checked bounds.
@mbauman
Copy link
Member Author

mbauman commented Jun 8, 2015

Thank you everyone for helping find the issues here. While we still need to find better performance for SubArrays, I think this issue has served its purpose. We're now back to the timings I was expecting to see for #10525. 👍

@mbauman mbauman closed this as completed Jun 8, 2015
@timholy
Copy link
Member

timholy commented Jun 8, 2015

I have to say, it's still a little unsatisfying that a week ago I had

julia> include("/tmp/mykern.jl")
follow up: mykern( 2:(siz - 3), s, a, x, y )
  11.795 milliseconds (6 allocations: 192 bytes)
follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )
 108.014 milliseconds (38263 allocations: 1119 KB)  # note: this includes compilation, ignore
follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )
  16.299 milliseconds (6 allocations: 192 bytes)
  16.692 milliseconds (6 allocations: 192 bytes)
  22.398 milliseconds (6 allocations: 192 bytes)

(using the test in #5117 (comment)) and now I have

julia> include("/tmp/mykern.jl")
follow up: mykern( 2:(siz - 3), s, a, x, y )
  10.265 milliseconds (6 allocations: 192 bytes)
follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )
 103.093 milliseconds (31840 allocations: 1266 KB)
follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )
  66.623 milliseconds (6 allocations: 192 bytes)
  67.183 milliseconds (6 allocations: 192 bytes)
  69.709 milliseconds (6 allocations: 192 bytes)

A 6-fold gap is big enough that I'd be reluctant to use SubArrays in real applications (and I have a lot of code that uses them now)---I'd prefer to just copy the data.

I think we should turn bounds-checking off on SubArrays until we can recover something close to the original performance.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 8, 2015

(I think I've lost track of some discussion half way through so I'm sorry if this has been answered before)

What has changed between the good (fast) commit and the bad (slow) commit in your (@timholy) comment above. Does #11617 and 56342be not fixing the problems introduced by ebfebfd and 677f537 respectively? Or is there another bad commit to be bisected or does the performance just gradually decreases over the past week(s)?

@timholy
Copy link
Member

timholy commented Jun 8, 2015

It's the fact that #10525 turned on bounds-checking for SubArrays, which we've never had before. Bounds-checking is a Good Thing, but the performance hit is not. For some reason our bounds-checking on Array is not causing anything like the same performance hit.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 8, 2015

Ahh. I see. Make sense. Thanks.

@mbauman
Copy link
Member Author

mbauman commented Jun 8, 2015

I think we should turn bounds-checking off on SubArrays until we can recover something close to the original performance.

I definitely agree. I'll work on improving the situation this week, but my time is somewhat limited.

@timholy
Copy link
Member

timholy commented Jun 8, 2015

No worries, it could be a hard problem, I frankly expect we'll have to leave it off for 0.4.

tkelman pushed a commit to tkelman/julia that referenced this issue Jun 8, 2015
This fixes the primary problem in JuliaLang#11595, but is not enough on its
own to recapture full performance. Interestingly, this does not appear to
cause a measurable slowing of the compilation of base.
tkelman pushed a commit to tkelman/julia that referenced this issue Jun 8, 2015
This address most of the remaining problem in JuliaLang#11595, but it's still about
three-fold short of the performance we had before getindex(::SubArray, ...)
checked bounds.
tkelman pushed a commit to tkelman/julia that referenced this issue Jun 8, 2015
@mbauman
Copy link
Member Author

mbauman commented Jun 8, 2015

On Julia just-before-10525, I'm seeing ~15ms for the last three tests. If I remove the @inbounds annotations, I get ~60-65ms.

On master, I get ~55ms, and @inbounds isn't actually disabling any bounds checks! The place where we need to improve is if I rewrite the algorithm to use unsafe_(get|set)index, it still lags behind quite a bit at ~40ms.

@mbauman
Copy link
Member Author

mbauman commented Jun 9, 2015

Hm, I don't understand why unsafe_get/setindex isn't giving us the performance we want here. If I simply remove checkbounds from SubArray's getindex, this gives the pre-10525 performance.

Well, at a minimum this is a very easy change to restore the old performance.

@mbauman
Copy link
Member Author

mbauman commented Jun 9, 2015

It's pretty ugly to rewrite mykern in terms of unsafe_* instead of @inbounds, but it is possible to restore the original performance here with #11625.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

6 participants