Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid jl_arrayunset in dicts with bitstypes; add some more @inbounds #30113

Merged
merged 3 commits into from
Nov 30, 2018

Conversation

chethega
Copy link
Contributor

No description provided.

Copy link
Sponsor Member

@StefanKarpinski StefanKarpinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good optimization. Also, doesn't jl_arrayunset just write a NULL pointer value in the non-bits type case? Which we don't need to call a C function in order to do.

@chethega
Copy link
Contributor Author

chethega commented Nov 21, 2018

Also, doesn't jl_arrayunset just write a NULL pointer value in the non-bits type case? Which we don't need to call a C function in order to do.

Afaik yes. In my own code, I would just unsafe_store! a zero.

But then, with the planned TBAA on arrays, I am not sure whether this would still be allowed in the future. Also, I'm not entirely sure about surprising assumptions about the object being rooted. So I left it in, for the sake of maintenance. @Keno ?

Also, in cases where we have a union, might need to deal with selector bytes? Say, Union{Missing,Nothing}. So this looks more complicated.

@Keno
Copy link
Member

Keno commented Nov 21, 2018

unsafe_store! is always legal since it doesn't have TBAA information, just potentially suboptimal because it prevents the compiler from knowing whether it interferes with another store. No worse than a ccall of course.

@chethega
Copy link
Contributor Author

Cool. Then we could do a fast pure julia array_unset that replaces the ccall by an unsafe_store!, is inlined away for bitstypes, and deals correctly with small bits unions (probably a nop also?).

That would be a separate PR, though (would need to go through all callsites of jl_arrayunset and document it as the preferred way of unsetting array items).

@yuyichao
Copy link
Contributor

yuyichao commented Nov 21, 2018

You can also just do this in codegen. That s a much less breaking change. You can even get better than info that way

@chethega
Copy link
Contributor Author

You can also just do this in codegen. That s a much less breaking change. You can even get better than info that way

By this you mean introducing a new intrinsic? I don't think I'm sufficiently proficient with codegen to do that quickly.

It appears that the only julialang callsite of jl_arrayunset is in dict.jl. The only other callsite google could find is in DataStructures.jl. Obviously jl_arrayunset needs to stay in the C interface, because it is needed functionality.

We could put a faster second implementation into array.jl. This one would be a new feature (maybe documented and exported), like e.g.:

@propagate_inbounds function array_unset(A::Array{T}, idx::Int)
@checkbounds boundscheck(A, idx)
isbitstype(T) || isbitsunion(T) || return nothing
@gc_preserve A begin
ptr = convert(Ptr{Int}, pointer(A,i))
unsafe_store!(ptr, 0)
end
nothing
end

That one could be upgraded into an intrinsic or a better version later (one that correctly propagates aliasing info instead of being potentially suboptimal).

But seeing that jl_arrayunset is not used in a lot of user code, I'm not sure whether an intrinsic is worth it.

@yuyichao
Copy link
Contributor

By this you mean introducing a new intrinsic?

No. I mean https://github.com/JuliaLang/julia/pull/20890/files

You can just follow the template.

@yuyichao
Copy link
Contributor

yuyichao commented Nov 21, 2018

More specifically, untested patch.

diff --git a/src/ccall.cpp b/src/ccall.cpp
index 0a9db2ca1d..4a6568e501 100644
--- a/src/ccall.cpp
+++ b/src/ccall.cpp
@@ -1764,6 +1764,29 @@ static jl_cgval_t emit_ccall(jl_codectx_t &ctx, jl_value_t **args, size_t nargs)
             }
         }
     }
+    else if (is_libjulia_func(jl_arrayunset) &&
+             argv[1].typ == (jl_value_t*)jl_ulong_type) {
+        assert(!isVa && !llvmcall && nargt == 2 && !addressOf.at(0) && !addressOf.at(1));
+        jl_value_t *aryex = ccallarg(0);
+        const jl_cgval_t &aryv = argv[0];
+        const jl_cgval_t &idxv = argv[1];
+        jl_datatype_t *arydt = (jl_datatype_t*)jl_unwrap_unionall(aryv.typ);
+        if (jl_is_array_type(arydt)) {
+            jl_value_t *ety = jl_tparam0(arydt);
+            if (jl_array_store_unboxed(ety)) {
+                JL_GC_POP();
+                return ghostValue(jl_void_type);
+            }
+            else if (!jl_has_free_typevars(ety)) {
+                Value *idx = emit_unbox(ctx, T_size, idxv, (jl_value_t*)jl_ulong_type);
+                Value *arrayptr = emit_bitcast(ctx, emit_arrayptr(ctx, aryv, aryex), T_ppjlvalue);
+                Value *slot_addr = ctx.builder.CreateGEP(arrayptr, idx);
+                tbaa_decorate(tbaa_arraybuf, ctx.builder.CreateStore(V_null, slot_addr));
+                JL_GC_POP();
+                return ghostValue(jl_void_type);
+            }
+        }
+    }
     else if (is_libjulia_func(jl_string_ptr)) {
         assert(lrt == T_size);
         assert(!isVa && !llvmcall && nargt == 1 && !addressOf.at(0));

edit: previous patch changed the wrong version of the copy...........................

@chethega
Copy link
Contributor Author

So if I look at this, we would lose bounds checking with the codegen variant. So that would be a minor change, just like the current jl_array_isassigned checks bounds for the C-API but does not check bounds when called from inside julia? (since google knows only 2 callsites the amount of affected code is probably limited)

As a second question, this mechanism could probably be used to inline the fast path of jl_array_grow_end, right? That would significantly speed up push! and hence filter and collectof unknown length iterators. That would be a major performance improvement for a lot of code.

For that, I would need to generate code that pessimistically checks whether the very fast path is applicable, and otherwise calls into array.c. Maybe it would be simpler to have a jl_array_grow_end_fastpath that either grows the array and returns true, or returns false if the capacity is exceeded. Then it would be the job of array.jl to first attempt the fast growth, and otherwise call into array.c. This would duplicate the branches, but that shouldn't matter (we will at least pay a realloc, and the branch should be well-predicted because we only call into the slow path if the fast path failed).

@yuyichao
Copy link
Contributor

So if I look at this, we would lose bounds checking with the codegen variant. So that would be a minor change, just like the current jl_array_isassigned checks bounds for the C-API but does not check bounds when called from inside julia? (since google knows only 2 callsites the amount of affected code is probably limited)

Yes. You can include the check if you want. Check emit_array_nd_index and it's caller if you want to do that. Pulling the bounds check out of the C API makes it easier to do bounds check elimination though.

As a second question, this mechanism could probably be used to inline the fast path of jl_array_grow_end, right? That would significantly speed up push! and hence filter and collectof unknown length iterators. That would be a major performance improvement for a lot of code.

Yes for implementing, unclear for the speed up and it'll be slightly harder to implement since you have more branch to check even for the fast case. Unfortunately the branches in the fast path aren't inferable (if the data is shared). You can always give up whenever you don't like the input types (bitsunion for example) so you can start simple if needed.

For that, I would need to generate code that pessimistically checks whether the very fast path is applicable, and otherwise calls into array.c. Maybe it would be simpler to have a jl_array_grow_end_fastpath that either grows the array and returns true, or returns false if the capacity is exceeded. Then it would be the job of array.jl to first attempt the fast growth, and otherwise call into array.c. This would duplicate the branches, but that shouldn't matter (we will at least pay a realloc, and the branch should be well-predicted because we only call into the slow path if the fast path failed).

Last time I checked the assembly code for jl_array_grow_* the code is well optimized for the fast path. It was before the bitsunion thing was added though. If the bitsunion code is what's slowing down the fast path then yes, that'll help. If you are only hit by function call overhead then it won't help since you still always have a function call. You also only need codegen support if you can't implement it in julia (don't have access to GC free-code/don't have a way to provide tbaa info etc) and if it's just about replacing a ccall with two you can just change the julia code.

@yuyichao
Copy link
Contributor

If the bitsunion code is what's slowing down the fast path then yes, that'll help.

Looking at the current version of the code, I won't be surprised if the bitunion code adds a lot of overhead. There are clearly a lot of inferrable branches in the fast path. In this case, I think you can just try sth like this,

diff --git a/src/array.c b/src/array.c
index e058d185e4..228d5294f1 100644
--- a/src/array.c
+++ b/src/array.c
@@ -702,7 +702,7 @@ static size_t limit_overallocation(jl_array_t *a, size_t alen, size_t newlen, si
 }

 STATIC_INLINE void jl_array_grow_at_beg(jl_array_t *a, size_t idx, size_t inc,
-                                        size_t n)
+                                        size_t n, int maybe_bitunion)
 {
     // designed to handle the case of growing and shrinking at both ends
     if (__unlikely(a->flags.isshared)) {
@@ -722,7 +722,7 @@ STATIC_INLINE void jl_array_grow_at_beg(jl_array_t *a, size_t idx, size_t inc,
     char *newdata;
     char *typetagdata;
     char *newtypetagdata;
-    int isbitsunion = jl_array_isbitsunion(a);
+    int isbitsunion = maybe_bitunion && jl_array_isbitsunion(a);
     if (isbitsunion) typetagdata = jl_array_typetagdata(a);
     if (a->offset >= inc) {
         // already have enough space in a->offset

And pass in maybe_bitunion=0 in sth like a jl_array_grow_*_nobitsunion. You can swap the call either in the julia code or in codegen if you feel like it. The compiler constant propagation usually do a really good job here and this is the method that should have the least amount of code duplication (you don't need to reimplement a perfectly fine C function in either llvm IR or julia)

@chethega
Copy link
Contributor Author

Last time I checked the assembly code for jl_array_grow_* the code is well optimized for the fast path.

Sure, but the function call overhead is killing us. In principle, we should need two well-predicted branches (flag and capacity) and two increments. I measure ~10 cycles per fastpath push!, which is pretty bad compared to what we should get when we inline this (probably will be limited by 1 branch/cycle retired in tight loops; maybe loops can even be rewritten such that the flag branch can be hoisted):

julia> using BenchmarkTools
julia> N=10_000; arr=collect(1:N);
julia> fp(arr,N) = (empty!(arr); for i=1:N push!(arr,i) end;)
julia> (@belapsed fp_test($arr, $N)) * 2 * 10^9 / N
12.58

@chethega
Copy link
Contributor Author

From my perspective this is ready to merge.

I think that the travis-ci fail is spurious; is there a way to trigger a new attempt?

Further improvements, i.e. inlining ccall(:jl_arrayunset, ...) in codegen, can be a separate PR, with different trade-offs re maintainability and C-API stability.

@JeffBezanson JeffBezanson merged commit 6bfe947 into JuliaLang:master Nov 30, 2018
@KristofferC KristofferC mentioned this pull request Dec 6, 2018
61 tasks
@KristofferC
Copy link
Sponsor Member

Would have been good to squash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants