Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve optimization passes to produce more compact IR #20853

Merged
merged 6 commits into from
Mar 10, 2017

Conversation

JeffBezanson
Copy link
Sponsor Member

I've noticed a few functions with pretty bloated IR (redundant variables, useless statements, etc.). This improves the optimization passes to address some of these cases. An extreme example is vector+vector broadcast, which with these changes goes from 248 statements to 127.

@@ -305,7 +305,8 @@ const workq = Vector{InferenceState}() # set of InferenceState objects that can

#### helper functions ####

@inline slot_id(s) = isa(s, SlotNumber) ? (s::SlotNumber).id : (s::TypedSlot).id # using a function to ensure we can infer this
@inline slot_id(s::Slot) =
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type-inference should be good enough now that this won't cause a regression, but I'm not sure it's ideal to require it

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? How could adding this declaration cause a regression?

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it blocks inlining turning the isa tests into a dynamic dispatch, unless we've correctly inferred that the value <: Slot

@@ -3359,7 +3360,8 @@ function is_pure_builtin(f::ANY)
f === Intrinsics.checked_srem_int ||
f === Intrinsics.checked_urem_int ||
f === Intrinsics.check_top_bit ||
f === Intrinsics.sqrt_llvm)
f === Intrinsics.sqrt_llvm ||
f === Intrinsics.cglobal) # cglobal throws an error for symbol-not-found
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of them can throw errors

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True... it's just the BLAS code relies on a cglobal call in void context throwing an error, which is a very atypical use of an intrinsic.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just fix that code. I've never been super happy with it throwing an exception during every normal startup anyways.

end

function var_matches(a::Union{Slot,SSAValue}, b::Union{Slot,SSAValue})
return ((isa(a,SSAValue) && isa(b,SSAValue)) || (isa(a,Slot) && isa(b,Slot))) && a.id == b.id
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be more compact and readable just to write this using dispatch

@@ -3657,7 +3657,7 @@ function inlineable(f::ANY, ft::ANY, e::Expr, atypes::Vector{Any}, sv::Inference
if sv.params.inlining
if isa(e.typ, Const) # || isconstType(e.typ)
if (f === apply_type || f === fieldtype || f === typeof ||
istopfunction(topmod, f, :typejoin) ||
istopfunction(topmod, f, :typejoin) || f === (===) ||
Copy link
Sponsor Member

@vtjnash vtjnash Mar 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have i's own line to be consistent with the other items here

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that === should often show up here as a Conditional object, not a Const

@ararslan ararslan added compiler:inference Type inference performance Must go faster labels Mar 1, 2017
@JeffBezanson
Copy link
Sponsor Member Author

@nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@JeffBezanson
Copy link
Sponsor Member Author

A very frustrating inlining-related regression here: the code for a convert method got smaller, allowing it to be inlined into promote, which prevented the promote method from getting inlined, which introduced an extra tuple allocation. Will have to do something about that.

@andyferris
Copy link
Member

andyferris commented Mar 2, 2017

I find @inline is a bit of a cancer... even 1-liners need an @inline because what they call might want to be inlined.

It seems to me that for one-liners (or two-liners, etc) this should be relatively safe - only code which would have been partially inlined anyway will get inlined just one more level (per @inline method).

Thus, I think we could relatively safely add @inline to a few more places in Base (even/especially where they seem unnecessary at first glance), including for promote. Interestingly, in StaticArrays I provide method specializations which are verbatim copies of Base methods but with an @inline decoration, which makes measurable performance improvements.

(Maybe small functions need a @propagate_inline meta (just joking! :P))

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 5, 2017

Will have to do something about that

It seems like maybe the inlining threshold should be (partially) based on the pre-inlined method body, or some ratio? But this is probably a discussion for a different place.

@JeffBezanson
Copy link
Sponsor Member Author

@nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

this was affected by using `Const` in more cases instead of `Type{}`
don't inline into a function `f` if doing so would put it over the
inlining threshhold, and if inlining `f` itself would help
avoid tuple allocations.

so far this is only used on `promote`, to limit the effects as
much as possible.
@JeffBezanson
Copy link
Sponsor Member Author

@nanosoldier runbenchmarks(ALL, vs = ":master")

@JeffBezanson
Copy link
Sponsor Member Author

Ok, what I did here was adjust the inlining pass to accumulate added statements into a single buffer, which I think makes the code a bit simpler, and allows us to easily observe how big the enclosing function is getting. Then I use this to avoid inlining into promote if it would make promote itself non-inlineable. This is helpful for bigints and bigfloats. There were some regressions when I applied this heuristic more widely; testing again now that it is applied only to promote.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@tkelman
Copy link
Contributor

tkelman commented Mar 8, 2017

interesting, factorizations got worse but almost everything else got better

@KristofferC
Copy link
Sponsor Member

@nanosoldier runbenchmarks(ALL, vs = ":master")

Double check.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@tkelman
Copy link
Contributor

tkelman commented Mar 9, 2017

svd and ["sparse","index",("spmat","logical",100)] slowdowns look real then

@JeffBezanson
Copy link
Sponsor Member Author

@nanosoldier runbenchmarks(ALL, vs = ":master")

@JeffBezanson
Copy link
Sponsor Member Author

I think I figured this one out. Replacing certain slots with equivalent ssavalues was causing LLVM to emit excessive numbers of memcpys to move tuples around. This can probably be considered a quasi-bug in LLVM (SROA pass I believe), since it should have been able to figure out that a tuple should be stack allocated to begin with and left in place. Here's a sample of the IR:

julia> G=Base.Generator{Base.Iterators.Prod2{UnitRange{Int64},UnitRange{Int64}},getfield(Base,Symbol("##54#55")){Float64,NTuple{5,Array{Float64,1}}}}

julia> code_llvm(Base.collect_to!, (Matrix{Float64}, G, Int, Tuple{Int,Int,Nullable{Int},Bool}))

define i8** @"julia_collect_to!_68175"(i8**, i8**, i64, { i64, i64, %Nullable.64, i8 }*) #0 !dbg !5 {
top:
  %.sroa.281 = alloca [7 x i8], align 1
  %.sroa.979 = alloca [7 x i8], align 1
  %4 = alloca { [2 x i64], { i64, i64, %Nullable.64, i8 } }, align 8
  %.sroa.264.sroa.0 = alloca [7 x i8], align 1
  %.sroa.5 = alloca [7 x i8], align 1
  %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.0" = alloca [7 x i8], align 1
  %"#temp#6.sroa.4.sroa.6" = alloca [7 x i8], align 1
  %st.sroa.7.0..sroa_idx = getelementptr inbounds { i64, i64, %Nullable.64, i8 }, { i64, i64, %Nullable.64, i8 }* %3, i64 0, i32 3
  %st.sroa.7.0.copyload = load i8, i8* %st.sroa.7.0..sroa_idx, align 1
  %5 = and i8 %st.sroa.7.0.copyload, 1
  %6 = icmp eq i8 %5, 0
  br i1 %6, label %if.lr.ph, label %L72

if.lr.ph:                                         ; preds = %top
  %7 = bitcast i8** %0 to double**
  %8 = load double*, double** %7, align 8
  %st.sroa.6.0..sroa_idx26 = getelementptr inbounds { i64, i64, %Nullable.64, i8 }, { i64, i64, %Nullable.64, i8 }* %3, i64 0, i32 2, i32 1
  %st.sroa.6.0.copyload = load i64, i64* %st.sroa.6.0..sroa_idx26, align 1
  %st.sroa.4.0..sroa_idx = getelementptr inbounds { i64, i64, %Nullable.64, i8 }, { i64, i64, %Nullable.64, i8 }* %3, i64 0, i32 2, i32 0
  %st.sroa.4.0.copyload = load i8, i8* %st.sroa.4.0..sroa_idx, align 1
  %st.sroa.3.0..sroa_idx18 = getelementptr inbounds { i64, i64, %Nullable.64, i8 }, { i64, i64, %Nullable.64, i8 }* %3, i64 0, i32 1
  %st.sroa.3.0.copyload = load i64, i64* %st.sroa.3.0..sroa_idx18, align 1
  %st.sroa.0.0..sroa_idx = getelementptr inbounds { i64, i64, %Nullable.64, i8 }, { i64, i64, %Nullable.64, i8 }* %3, i64 0, i32 0
  %st.sroa.0.0.copyload = load i64, i64* %st.sroa.0.0..sroa_idx, align 1
  %9 = getelementptr i8*, i8** %1, i64 1
  %10 = getelementptr i8*, i8** %1, i64 2
  %11 = bitcast i8** %10 to i64*
  %12 = bitcast i8** %9 to i64*
  %.sroa.264.sroa.0.0..sroa_idx = getelementptr inbounds [7 x i8], [7 x i8]* %.sroa.264.sroa.0, i64 0, i64 0
  %13 = getelementptr i8*, i8** %1, i64 4
  %14 = bitcast i8** %13 to i64*
  %.sroa.3.sroa.5.33..sroa.5.0..sroa_idx.sroa_idx = getelementptr inbounds [7 x i8], [7 x i8]* %.sroa.5, i64 0, i64 0
  %".sroa.3.sroa.3.sroa.2.sroa.0.0.#temp#6.sroa.4.sroa.4.sroa.3.sroa.0.0..sroa_idx85.sroa_idx" = getelementptr inbounds [7 x i8], [7 x i8]* %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.0", i64 0, i64 0
  %"#temp#6.sroa.4.sroa.6.33..sroa_idx" = getelementptr inbounds [7 x i8], [7 x i8]* %"#temp#6.sroa.4.sroa.6", i64 0, i64 0
  %"#temp#6.sroa.0.0..sroa_idx" = getelementptr inbounds { [2 x i64], { i64, i64, %Nullable.64, i8 } }, { [2 x i64], { i64, i64, %Nullable.64, i8 } }* %4, i64 0, i32 0, i64 0
  %"#temp#6.sroa.3.0..sroa_idx43" = getelementptr inbounds { [2 x i64], { i64, i64, %Nullable.64, i8 } }, { [2 x i64], { i64, i64, %Nullable.64, i8 } }* %4, i64 0, i32 0, i64 1
  %"#temp#6.sroa.4.sroa.0.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx" = getelementptr inbounds { [2 x i64], { i64, i64, %Nullable.64, i8 } }, { [2 x i64], { i64, i64, %Nullable.64, i8 } }* %4, i64 0, i32 1, i32 0
  %"#temp#6.sroa.4.sroa.3.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx57" = getelementptr inbounds { [2 x i64], { i64, i64, %Nullable.64, i8 } }, { [2 x i64], { i64, i64, %Nullable.64, i8 } }* %4, i64 0, i32 1, i32 1
  %"#temp#6.sroa.4.sroa.4.sroa.0.0.#temp#6.sroa.4.sroa.4.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx.sroa_idx" = getelementptr inbounds { [2 x i64], { i64, i64, %Nullable.64, i8 } }, { [2 x i64], { i64, i64, %Nullable.64, i8 } }* %4, i64 0, i32 1, i32 2, i32 0
  %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.0.0.#temp#6.sroa.4.sroa.4.sroa.3.0.#temp#6.sroa.4.sroa.4.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx.sroa_raw_idx.sroa_raw_cast" = bitcast { [2 x i64], { i64, i64, %Nullable.64, i8 } }* %4 to i8*
  %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.0.0.#temp#6.sroa.4.sroa.4.sroa.3.0.#temp#6.sroa.4.sroa.4.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx.sroa_raw_idx.sroa_raw_idx" = getelementptr inbounds i8, i8* %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.0.0.#temp#6.sroa.4.sroa.4.sroa.3.0.#temp#6.sroa.4.sroa.4.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx.sroa_raw_idx.sroa_raw_cast", i64 33
  %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.3.0.#temp#6.sroa.4.sroa.4.sroa.3.0.#temp#6.sroa.4.sroa.4.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx.sroa_raw_idx.sroa_idx87" = getelementptr inbounds { [2 x i64], { i64, i64, %Nullable.64, i8 } }, { [2 x i64], { i64, i64, %Nullable.64, i8 } }* %4, i64 0, i32 1, i32 2, i32 1
  %"#temp#6.sroa.4.sroa.5.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx" = getelementptr inbounds { [2 x i64], { i64, i64, %Nullable.64, i8 } }, { [2 x i64], { i64, i64, %Nullable.64, i8 } }* %4, i64 0, i32 1, i32 3
  %"#temp#6.sroa.4.sroa.6.0.#temp#6.sroa.4.0..sroa_cast.sroa_raw_idx" = getelementptr inbounds i8, i8* %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.0.0.#temp#6.sroa.4.sroa.4.sroa.3.0.#temp#6.sroa.4.sroa.4.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx.sroa_raw_idx.sroa_raw_cast", i64 49
  %15 = bitcast i8** %1 to i8***
  %16 = getelementptr inbounds { [2 x i64], { i64, i64, %Nullable.64, i8 } }, { [2 x i64], { i64, i64, %Nullable.64, i8 } }* %4, i64 0, i32 0
  %.sroa.677.sroa.0.0..sroa.281.1..sroa_idx.sroa_idx = getelementptr inbounds [7 x i8], [7 x i8]* %.sroa.281, i64 0, i64 0
  %.sroa.979.33..sroa_idx = getelementptr inbounds [7 x i8], [7 x i8]* %.sroa.979, i64 0, i64 0
  br label %if

if:                                               ; preds = %if.lr.ph, %L44
  %i.095 = phi i64 [ %2, %if.lr.ph ], [ %32, %L44 ]
  %st.sroa.0.094 = phi i64 [ %st.sroa.0.0.copyload, %if.lr.ph ], [ %st.sroa.0.0.copyload17, %L44 ]
  %st.sroa.3.093 = phi i64 [ %st.sroa.3.0.copyload, %if.lr.ph ], [ %st.sroa.3.0.copyload20, %L44 ]
  %st.sroa.4.092 = phi i8 [ %st.sroa.4.0.copyload, %if.lr.ph ], [ %st.sroa.4.0.copyload22, %L44 ]
  %st.sroa.6.091 = phi i64 [ %st.sroa.6.0.copyload, %if.lr.ph ], [ %st.sroa.6.0.copyload28, %L44 ]
  %17 = and i8 %st.sroa.4.092, 1
  %18 = icmp eq i8 %17, 0
  br i1 %18, label %if7, label %L35

L72.loopexit:                                     ; preds = %L44
  br label %L72

L72:                                              ; preds = %L72.loopexit, %top
  ret i8** %0

if7:                                              ; preds = %if
  %19 = add i64 %st.sroa.3.093, 1
  br label %L35

L35:                                              ; preds = %if, %if7
  %s24.0 = phi i64 [ %19, %if7 ], [ %st.sroa.3.093, %if ]
  %v2.0 = phi i64 [ %st.sroa.3.093, %if7 ], [ %st.sroa.6.091, %if ]
  %20 = load i64, i64* %11, align 8
  %21 = icmp eq i64 %st.sroa.0.094, %20
  br i1 %21, label %if8, label %L41

if8:                                              ; preds = %L35
  %22 = load i64, i64* %12, align 1
  %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.097" = getelementptr inbounds [7 x i8], [7 x i8]* %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.0", i64 0, i64 0
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.097", i8* %.sroa.264.sroa.0.0..sroa_idx, i64 7, i32 1, i1 false)
  %23 = load i64, i64* %14, align 8
  %24 = add i64 %23, 1
  %25 = icmp eq i64 %s24.0, %24
  %26 = zext i1 %25 to i8
  %"#temp#6.sroa.4.sroa.698" = getelementptr inbounds [7 x i8], [7 x i8]* %"#temp#6.sroa.4.sroa.6", i64 0, i64 0
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %"#temp#6.sroa.4.sroa.698", i8* %.sroa.3.sroa.5.33..sroa.5.0..sroa_idx.sroa_idx, i64 7, i32 1, i1 false)
  br label %L44

L41:                                              ; preds = %L35
  %27 = add i64 %st.sroa.0.094, 1
  %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.0100" = getelementptr inbounds [7 x i8], [7 x i8]* %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.0", i64 0, i64 0
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.0100", i8* %.sroa.677.sroa.0.0..sroa.281.1..sroa_idx.sroa_idx, i64 7, i32 1, i1 false)
  %"#temp#6.sroa.4.sroa.6101" = getelementptr inbounds [7 x i8], [7 x i8]* %"#temp#6.sroa.4.sroa.6", i64 0, i64 0
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %"#temp#6.sroa.4.sroa.6101", i8* %.sroa.979.33..sroa_idx, i64 7, i32 1, i1 false)
  br label %L44

L44:                                              ; preds = %L41, %if8
  %"#temp#6.sroa.4.sroa.0.0" = phi i64 [ %22, %if8 ], [ %27, %L41 ]
  %"#temp#6.sroa.4.sroa.5.0" = phi i8 [ %26, %if8 ], [ 0, %L41 ]
  %"#temp#6.sroa.4.sroa.4.sroa.0.0" = phi i8 [ 0, %if8 ], [ 1, %L41 ]
  store i64 %st.sroa.0.094, i64* %"#temp#6.sroa.0.0..sroa_idx", align 8
  store i64 %v2.0, i64* %"#temp#6.sroa.3.0..sroa_idx43", align 8
  store i64 %"#temp#6.sroa.4.sroa.0.0", i64* %"#temp#6.sroa.4.sroa.0.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx", align 8
  store i64 %s24.0, i64* %"#temp#6.sroa.4.sroa.3.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx57", align 8
  store i8 %"#temp#6.sroa.4.sroa.4.sroa.0.0", i8* %"#temp#6.sroa.4.sroa.4.sroa.0.0.#temp#6.sroa.4.sroa.4.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx.sroa_idx", align 8
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.0.0.#temp#6.sroa.4.sroa.4.sroa.3.0.#temp#6.sroa.4.sroa.4.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx.sroa_raw_idx.sroa_raw_idx", i8* %".sroa.3.sroa.3.sroa.2.sroa.0.0.#temp#6.sroa.4.sroa.4.sroa.3.sroa.0.0..sroa_idx85.sroa_idx", i64 7, i32 1, i1 false)
  store i64 %v2.0, i64* %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.3.0.#temp#6.sroa.4.sroa.4.sroa.3.0.#temp#6.sroa.4.sroa.4.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx.sroa_raw_idx.sroa_idx87", align 8
  store i8 %"#temp#6.sroa.4.sroa.5.0", i8* %"#temp#6.sroa.4.sroa.5.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx", align 8
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %"#temp#6.sroa.4.sroa.6.0.#temp#6.sroa.4.0..sroa_cast.sroa_raw_idx", i8* %"#temp#6.sroa.4.sroa.6.33..sroa_idx", i64 7, i32 1, i1 false)
  %28 = load i8**, i8*** %15, align 8
  %29 = call double @"julia_#54_68176"(i8** %28, [2 x i64]* %16)
  %st.sroa.0.0.copyload17 = load i64, i64* %"#temp#6.sroa.4.sroa.0.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx", align 8
  %st.sroa.3.0.copyload20 = load i64, i64* %"#temp#6.sroa.4.sroa.3.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx57", align 8
  %st.sroa.4.0.copyload22 = load i8, i8* %"#temp#6.sroa.4.sroa.4.sroa.0.0.#temp#6.sroa.4.sroa.4.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx.sroa_idx", align 8
  %st.sroa.6.0.copyload28 = load i64, i64* %"#temp#6.sroa.4.sroa.4.sroa.3.sroa.3.0.#temp#6.sroa.4.sroa.4.sroa.3.0.#temp#6.sroa.4.sroa.4.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx.sroa_raw_idx.sroa_idx87", align 8
  %st.sroa.7.0.copyload30 = load i8, i8* %"#temp#6.sroa.4.sroa.5.0.#temp#6.sroa.4.0..sroa_cast.sroa_idx", align 8
  %30 = add i64 %i.095, -1
  %31 = getelementptr double, double* %8, i64 %30
  store double %29, double* %31, align 8
  %32 = add i64 %i.095, 1
  %33 = and i8 %st.sroa.7.0.copyload30, 1
  %34 = icmp eq i8 %33, 0
  br i1 %34, label %if, label %L72.loopexit
}

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@JeffBezanson
Copy link
Sponsor Member Author

Much better, but still a couple regressions. I'll try tweaking the new rule here.

@JeffBezanson
Copy link
Sponsor Member Author

Ok, the first 2 regressions seem to be noise; I don't see any code differences. The sparse indexing regression is real, but a bit perverse: the simpler code this PR generates allows LLVM to vectorize over some of the integer vectors, which seems to cause a slight slowdown. A vectorization cost model issue I suppose. I'm not sure if we can or should do anything about it; that would amount to copying values from arrays into unnecessary temporary variables in hope of blocking vectorization, but we don't know when that would be profitable any better than LLVM does.

To reproduce the LLVM IR:

julia> m = sprand(100,100,0.1);

julia> j = find(rand(Bool,100));

@code_llvm Base.SparseArrays.getindex_I_sorted_linear(m, j, j)

@Sacha0
Copy link
Member

Sacha0 commented Mar 10, 2017

Sounds best to ignore the single, minor sparse indexing performance regression then? The widespread improvements this change provides are fantastic.

@KristofferC
Copy link
Sponsor Member

I strongly agree

@JeffBezanson JeffBezanson merged commit 0e970f0 into master Mar 10, 2017
@tkelman tkelman deleted the jb/IRcleanup branch March 10, 2017 20:14
@StefanKarpinski
Copy link
Sponsor Member

Seems like the best course of action is to merge and then open an issue about the regression.

@tkelman
Copy link
Contributor

tkelman commented Mar 10, 2017

Probably report the IR example upstream, if vectorization is causing a slowdown.

vtjnash added a commit that referenced this pull request Jul 9, 2020
Added in #22210 (and earlier begun in #20853), this is no longer
necessary to avoid heap allocations, and thus serves little purpose now.
vtjnash added a commit that referenced this pull request Jul 13, 2020
Added in #22210 (and earlier begun in #20853), this is no longer
necessary to avoid heap allocations, and thus serves little purpose now.
vtjnash added a commit that referenced this pull request Jul 20, 2020
Added in #22210 (and earlier begun in #20853), this is no longer
necessary to avoid heap allocations, and thus serves little purpose now.
vtjnash added a commit that referenced this pull request Sep 1, 2020
Added in #22210 (and earlier begun in #20853), this is no longer
necessary to avoid heap allocations, and thus serves little purpose now.
vtjnash added a commit that referenced this pull request Sep 3, 2020
Added in #22210 (and earlier begun in #20853), this is no longer
necessary to avoid heap allocations, and thus serves little purpose now.
c42f pushed a commit that referenced this pull request Sep 23, 2020
Added in #22210 (and earlier begun in #20853), this is no longer
necessary to avoid heap allocations, and thus serves little purpose now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:inference Type inference performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants