Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manually deconstruct and reconstruct subarray when throwing boundserror #29867

Merged
merged 1 commit into from
Nov 1, 2018

Conversation

KristofferC
Copy link
Sponsor Member

@KristofferC KristofferC commented Oct 31, 2018

This is perhaps stupid but anyway.

julia> f(x) = view(x, :)[3]

Before

julia> @btime f($x);
  8.618 ns (1 allocation: 48 bytes)

After

julia> @btime f($x);
  5.526 ns (0 allocations: 0 bytes)

IR Before

julia> @code_typed f(x)
CodeInfo(
1 1 ── %1  = (Base.arraysize)(x, 1)::Int64                                                               │╻╷╷╷╷╷    view
  │    %2  = (Base.slt_int)(%1, 0)::Bool                                                                 ││╻╷╷╷      to_indices
  │    %3  = (Base.ifelse)(%2, 0, %1)::Int64                                                             │││┃││││││   eachindex
  │    %4  = %new(OneTo{Int64}, %3)::OneTo{Int64}                                                        ││││┃││││     axes1
  │    %5  = %new(Slice{OneTo{Int64}}, %4)::Slice{OneTo{Int64}}                                          ││││╻╷╷       uncolon
  └───       goto #3 if not true                                                                         │╻         view
  2 ──       (Base.arraysize)(x, 1)::Int64                                                               ││╻╷╷╷╷     checkbounds
  3 ── %8  = (Core.tuple)(%5)::Tuple{Slice{OneTo{Int64}}}                                                ││╻╷╷       unsafe_view
  │          (Base.arraysize)(x, 1)::Int64                                                               │││╻╷╷       Type
  │    %10 = %new(SubArray{Float64,1,Array{Float64,1},Tuple{Slice{OneTo{Int64}}},true}, x, %8, 0, 1)::SubArray{Float64,1,Array{Float64,1},Tuple{Slice{OneTo{Int64}}},true}
  └───       goto #4                                                                                     ││        
  4 ──       goto #9 if not true                                                                         │╻         getindex
  5 ── %13 = (Core.tuple)(3)::Tuple{Int64}                                                               ││        
  │    %14 = (Base.sle_int)(1, 3)::Bool                                                                  ││╻╷╷       checkbounds
  │    %15 = (Base.sle_int)(3, %3)::Bool                                                                 │││┃││       checkbounds
  │    %16 = (Base.and_int)(%14, %15)::Bool                                                              ││││╻         checkindex
  └───       goto #7 if not %16                                                                          │││       
  6 ──       goto #8                                                                                     │││       
  7 ──       invoke Base.throw_boundserror(%10::SubArray{Float64,1,Array{Float64,1},Tuple{Base.Slice{Base.OneTo{Int64}}},true}, %13::Tuple{Int64})::Union{}
  └───       $(Expr(:unreachable))::Union{}                                                              │││       
  8 ┄─       nothing::Nothing9 ── %22 = (Base.add_int)(0, 3)::Int64                                                                 ││╻         +%23 = (Base.arrayref)(false, x, %22)::Float64                                                     ││╻         getindex
  └───       goto #10                                                                                    │╻         getindex
  10return %23                                                                                  │         
) => Float64

IR After

julia> @code_typed f(x)
CodeInfo(
1 1 ── %1  = (Base.arraysize)(x, 1)::Int64                                                         │╻╷╷╷╷╷    view
  │    %2  = (Base.slt_int)(%1, 0)::Bool                                                           ││╻╷╷╷      to_indices
  │    %3  = (Base.ifelse)(%2, 0, %1)::Int64                                                       │││┃││││││   eachindex
  │    %4  = %new(OneTo{Int64}, %3)::OneTo{Int64}                                                  ││││┃││││     axes1
  │    %5  = %new(Slice{OneTo{Int64}}, %4)::Slice{OneTo{Int64}}                                    ││││╻╷╷       uncolon
  └───       goto #3 if not true                                                                   │╻         view
  2 ──       (Base.arraysize)(x, 1)::Int64                                                         ││╻╷╷╷╷     checkbounds
  3 ── %8  = (Core.tuple)(%5)::Tuple{Slice{OneTo{Int64}}}                                          ││╻╷╷       unsafe_view
  │          (Base.arraysize)(x, 1)::Int64                                                         │││╻╷╷       Type
  └───       goto #4                                                                               ││        
  4 ──       goto #9 if not true                                                                   │╻         getindex
  5 ── %12 = (Core.tuple)(3)::Tuple{Int64}                                                         ││        
  │    %13 = (Base.sle_int)(1, 3)::Bool                                                            ││╻╷╷       checkbounds
  │    %14 = (Base.sle_int)(3, %3)::Bool                                                           │││┃││       checkbounds
  │    %15 = (Base.and_int)(%13, %14)::Bool                                                        ││││╻         checkindex
  └───       goto #7 if not %15                                                                    │││       
  6 ──       goto #8                                                                               │││       
  7 ──       invoke Main.__subarray_throw_boundserror(SubArray{Float64,1,Array{Float64,1},Tuple{Base.Slice{Base.OneTo{Int64}}},true}::Type, _2::Array{Float64,1}, %8::Tuple{Base.Slice{Base.OneTo{Int64}}}, 0::Int64, 1::Int64, %12::Tuple{Int64})::Union{}$(Expr(:unreachable))::Union{}                                                        ││││      
  │          φ ()::Union{}                                                                         │││       
  └───       $(Expr(:unreachable))::Union{}                                                        │││       
  8 ┄─       nothing::Nothing9 ── %23 = (Base.add_int)(0, 3)::Int64                                                           ││╻         +%24 = (Base.arrayref)(false, x, %23)::Float64                                               ││╻         getindex
  └───       goto #10                                                                              │╻         getindex
  10return %24                                                                            │         
) => Float64

So long dear %10.

Fixes #29860

@KristofferC KristofferC added performance Must go faster domain:arrays [a, r, r, a, y, s] labels Oct 31, 2018
@KristofferC
Copy link
Sponsor Member Author

@nanosoldier runbenchmarks(ALL, vs = ":master")

@KristofferC KristofferC changed the title manually destruct and reconstruct subarray when throwing boundserror manually deconstruct and reconstruct subarray when throwing boundserror Oct 31, 2018
@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@KristofferC
Copy link
Sponsor Member Author

Copy link
Sponsor Member

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of stupid, yes, but also admirably clever 😈. Performancewise it's a nice improvement in a well-benchmarked area.

And we'll always remember %10.

@KristofferC KristofferC merged commit 13694f4 into master Nov 1, 2018
@ararslan ararslan deleted the kc/wutwut branch November 1, 2018 00:14
@maleadt
Copy link
Member

maleadt commented Jan 9, 2019

This change introduces a invoke(__subarray_throw_boundserror) into the calling function, making the pretty common operation of indexing a SubArray GPU-incompatible (see JuliaGPU/CUDAnative.jl#313):

julia> function f(dest)
           dest[1] = 1
           nothing
       end
kernel (generic function with 1 method)

julia> tt = Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{UnitRange{Int64},UnitRange{Int64}},false}}
Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{UnitRange{Int64},UnitRange{Int64}},false}}

julia> code_llvm(f, tt)

...
      %68 = call nonnull %jl_value_t addrspace(10)* @jl_invoke(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139861354488080 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %3, i32 7)
      call void @llvm.trap()
      unreachable
...

I'm not sure this is wanted, both from a GPU-compatibility POV as well as not to introduce an indirect call for SubArray indexing. Revert to have 1.1 work with CuArrays? Or any suggested improvements?
I'd rather not ship a Base.throw_boundserror(A::SubArray{<:CuArray} method.

@KristofferC
Copy link
Sponsor Member Author

KristofferC commented Jan 9, 2019

Or any suggested improvements?

It is difficult to program against an unknown, non CI-enabled, subset of the language.

While the performance improvements in this PR are on their own probably not very important, this PR + #29868 makes quite a bit of difference for broadcasted updates to small arrays. I guess we can revert here but the current situation is very much non-ideal.

as well as not to introduce an indirect call for SubArray indexing

The indirect call is only when we are throwing, so do we care (modulo GPU codegen)?

@maleadt
Copy link
Member

maleadt commented Jan 9, 2019

It is difficult to program against an unknown, non CI-enabled, subset of the language.

Sure, but that's a whole different problem (where to host, which package versions to test against, etc) and not relevant here. I'm doing my best to ensure compatibility, hence my comment. Since the regular code_llvm, ie. without depending on CUDAnative, also shows a jl_invoke there's no reason why this should be difficult to try or suggest improvements for...

@KristofferC
Copy link
Sponsor Member Author

KristofferC commented Jan 9, 2019

Sure, but that's a whole different problem and not relevant here.

Isn't it relevant here since the whole reason there is a problem here is that the code generated here does not adhere to the GPU subset?

there's no reason why this should be difficult to try or suggest improvements for...

The difficult part is that (AFAIU) the reason the SubArray manages to get optimized away is due to the indirect call.

This change introduces a invoke(__subarray_throw_boundserror) into the calling function

Does the code you linked really have anything to do with the change here?

; │││││ @ abstractarray.jl:1741 within `_ind2sub' @ abstractarray.jl:1779
; │││││┌ @ abstractarray.jl:1792 within `_ind2sub_recurse'
; ││││││┌ @ abstractarray.jl:1799 within `_div'
; │││││││┌ @ int.jl:232 within `div'
          br i1 %72, label %pass, label %fail

fail:                                             ; preds = %L37
          call void @jl_throw(%jl_value_t addrspace(12)* addrspacecast (%jl_value_t* inttoptr (i64 4564098480 to %jl_value_t*) to %jl_value_t addrspace(12)*))
          unreachable

That code is there on 1.0.3 as well.

I guess it is the

      %69 = call nonnull %jl_value_t addrspace(10)* @jl_invoke(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4587473168 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %3, i32 7)

that is the problem?

maleadt added a commit to JuliaGPU/CUDAnative.jl that referenced this pull request Jan 9, 2019
@maleadt
Copy link
Member

maleadt commented Jan 9, 2019

Isn't it relevant here since the whole reason there is a problem here is that the code generated here does not adhere to the GPU subset?

Sure, I meant the difficulty to set-up CI. I also wouldn't expect Base to adhere to that subset right away since it often motivates me to implement some features 🙂 (case in point, JuliaGPU/CUDAnative.jl#314 was motivated by this issue).

And yes I copied the wrong snippet, it's the invoke indeed.

The difficult part is that (AFAIU) the whole reason why the SubArray manages to get optimized away is due to the indirect call.

Does that need a invoke-style indirection? Wouldn't a regular, non-inline function that can be called directly be sufficient as well to not have the gcframe in the hot path? TBH I don't fully understand when the compiler decides to invoke instead of call.

@KristofferC
Copy link
Sponsor Member Author

KristofferC commented Jan 9, 2019

Does that need a invoke-style indirection? Wouldn't a regular, non-inline function that can be called directly be sufficient as well to not have the gcframe in the hot path?

I think that is correct. I am not sure either why an invoke is emitted.

@KristofferC
Copy link
Sponsor Member Author

KristofferC commented Jan 9, 2019

I think this works around it:

diff --git a/base/subarray.jl b/base/subarray.jl
index 2d6f8a709a..8d133b7e1e 100644
--- a/base/subarray.jl
+++ b/base/subarray.jl
@@ -42,7 +42,7 @@ check_parent_index_match(parent, ::NTuple{N, Bool}) where {N} =
 # are inlined
 @inline Base.throw_boundserror(A::SubArray, I) =
     __subarray_throw_boundserror(typeof(A), A.parent, A.indices, A.offset1, A.stride1, I)
-@noinline __subarray_throw_boundserror(T, parent, indices, offset1, stride1, I) =
+@noinline __subarray_throw_boundserror(::Type{T}, parent, indices, offset1, stride1, I) where {T} =
     throw(BoundsError(T(parent, indices, offset1, stride1), I))
 
 # This computes the linear indexing compatibility for a given tuple of indices

Could you check. For me a

  call void @julia___subarray_throw_boundserror_12207(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4765226816 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* nonnull %32, [2 x { i64, i64 }] addrspace(11)* nocapture readonly %35, i64 %37, i64 %39, [1 x i64] addrspace(11)* nocapture readonly %40)

is now emitted.

@maleadt
Copy link
Member

maleadt commented Jan 9, 2019

Yeah that seems great. I'll create a PR.

maleadt added a commit that referenced this pull request Jan 9, 2019
KristofferC pushed a commit that referenced this pull request Jan 9, 2019
Improves #29867 by avoiding an invoke.

(cherry picked from commit f2d2d6f)
KristofferC pushed a commit that referenced this pull request Jan 10, 2019
KristofferC pushed a commit that referenced this pull request Jan 11, 2019
Improves #29867 by avoiding an invoke.

(cherry picked from commit f2d2d6f)
KristofferC added a commit that referenced this pull request May 5, 2020
KristofferC added a commit that referenced this pull request May 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants