Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC safepoint and transition support #14190

Merged
merged 10 commits into from
Jan 5, 2016
Merged

GC safepoint and transition support #14190

merged 10 commits into from
Jan 5, 2016

Conversation

yuyichao
Copy link
Contributor

This is a work in progress to make our GC a stop the world GC rather than a wait for the world (and maybe dead lock?) GC (ok that probably doesn't sound too exciting...). The first two commits are #14181 which I made as a general improvement to our finalizers handling while trying to figure out how to synchronize finalizers and the GC with threading for this PR. (merged)

This is still missing the very important GC transition and safepoint support in codegen but I'm opening this as RFC now to make sure that the GC synchronization schematics I'm following makes sense.

I described the schmatics in the comment in gc.c. To summarize, we need GC safe point and GC transitions so that the GC doesn't have to wait for other threads forever (especially if the thread doing the GC is holding a lock). The safepoint is implemented using a volatile load from a special page which we mprotect to PROT_NONE during a GC and catch the SegFault in the signal handler. (This is similar to how the Java HotSpot GC does it.) The GC also makes sure (with a combination of lock and safe point) that no other thread can run the GC at the same time or modifying GC related data structures when the GC is running.

TODO/request for help:

  • The segfault handler for OSX is missing. It seems that I can't just wait in the signal handler since it's running on a different thread processing the messages from the kernel in series and it is necessary to do some low level stuff to make sure the thread can return to it's previous state after waiting for the GC to finish. Doing this on a platform that I can't test locally is beyond what I can confidently do so I need help from someone who's more familiar with this to implement it.

  • Codegen support:

    I'm currently planning to emit a GC transition around each critical region (store to object or stack) and allow the GC to run everywhere else. We can do a optimization pass to merge unnecessary GC transitions after the whole function is generated. I'm still deciding what I can assume about GC awareness for a function call and what's the convension of changing the GC state in a function. Note that this shouldn't require the LLVM statepoint support so it should work on LLVM 3.3 as well.

c.c. @vtjnash @kpamnany @carnaval @Keno

Close #13380
Close #14343
Close #10317

TODO for this PR:
  • SegFault handler for OSX
  • Rename managed / unmanaged to unsafe / safe
  • Run finalizers after the GC is finished and allowing GC to run in finalizers (Still need to update the comment in gc.c) done.
  • Easier way to breakpoint on actual segfault (jl_critical_error should be good enough)
TODO that can be done in future PRs:

@yuyichao yuyichao added domain:multithreading Base.Threads and related functionality GC Garbage collector labels Nov 30, 2015
@yuyichao yuyichao force-pushed the yyc/gc/safepoint branch 4 times, most recently from 2bf2e16 to da7b6e0 Compare December 1, 2015 03:30
@yuyichao yuyichao mentioned this pull request Dec 1, 2015
JL_DLLEXPORT void jl_gc_collect(int full)
{
if (!is_gc_enabled || jl_in_gc)
return;
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a statepoint here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Multiple threads entering the GC is handled separately below to avoid running the GC multiple times.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't realize jl_in_gc was thread-local now. it seems like that flag should never be true here now in that case?

to return to the original purpose of this flag, it should perhaps be renamed jl_in_finalizers and moved to only protecting the finalizer queue from task switches? the original intent of this flag was to disallow finalizers from descheduling the thread and switching to running other tasks at unexpected yield points (any allocation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be true in finalizer.

it should perhaps be renamed jl_in_finalizers

Yeah, that's effectively what it is for now.

@JeffBezanson
Copy link
Sponsor Member

This is great. I think we might be able to merge this fairly soon, since even without codegen support it appears to be strictly better than what we have now, where there is effectively a single GC statepoint (plus it doesn't work).

// Add two arguments, one for the default value of `old_state`, one to satisfy
// ISO C++11 standard.
#define jl_gc_state_set_and_save(...) \
jl_gc_state_set_and_save__(__VA_ARGS__, 0, 0)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Variadic macros are a new feature in C99. GNU CPP has supported them for a long time, but only with a named variable argument (‘args...’, not ‘...’ and __VA_ARGS__). If you are concerned with portability to previous versions of GCC, you should use only named variable arguments. On the other hand, if you are concerned with portability to other conforming implementations of C99, you should use only __VA_ARGS__."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only question is what does MSVC support and IIRC it supports the __VA_ARGS__ version in c++ mode. @tkelman

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in the end this is just some trick to avoid having two names so can easily be removed if we can't support on all compilers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d:/code/msys64/home/Tony/julia/src/gc.c(2480) : error C2660: 'jl_gc_state_set_an
d_save_' : function does not take 3 arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... fine... I can just rename it.........

@yuyichao
Copy link
Contributor Author

yuyichao commented Dec 3, 2015

Yes, it should be mostly safe to merge in it's current state. The missing part for the state point is OSX support and I'm hoping someone more familiar with x64 calling convension and OSX signal handling can implement that ;-).

I'm currently working on settling on a GC state transition convention and make sure the the runtime follows that convension (which is time consuming since I need to go through all the functions in src/).

As I mentioned in other threads, there's a few other GC fixes that are kind of independent with this PR including locking for write barrier and scanning thread stack correctly. Those should be relatively easy to fix, I'll probably submit a PR for those this weekend but anyone else should feel free to fix those.

@JeffBezanson
Copy link
Sponsor Member

IIUC, the purpose of the GC state transitions is to allow GC inside unmanaged code? I don't really like the idea of needing to insert lots of explicit state transitions all over. I would rather interrupt threads asynchronously and do conservative register scanning.

// TODO: before we support getting return value from
// the work, we don't need to enter GC managed region
// when starting the work.
int8_t state = jl_gc_managed_enter();
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this state be added to the exception state (jl_enter / jl_leave / jl_rethrow)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is on my uncommitted local changes.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 3, 2015

The segfault handler for OSX is missing. It seems that I can't just wait in the signal handler since it's running on a different thread processing the messages from the kernel in series and it is necessary to do some low level stuff to make sure the thread can return to it's previous state after waiting for the GC to finish. Doing this on a platform that I can't test locally is beyond what I can confidently do so I need help from someone who's more familiar with this to implement it.

i believe the right approach is to return MIG_NO_REPLY from catch_exception_raise (which will leave the thread unscheduled), then call thread_resume to continue running it later after gc is finished.

@yuyichao
Copy link
Contributor Author

yuyichao commented Dec 3, 2015

I agree that could be simpler if we have a conservative scan already. (The current approach isn't too bad though, and it's nice to see that some functions can actually be run simultaneously with the GC).

More importantly, I think it is still possible to pause a thread at a point that can cause corruption. (in the GC allocation functions, for example). I'm also not totally convinced that the GC can have a consistent view of the object if we pause the thread at any time. (e.g. what if we pause a thread in the middle of a buffer realloc). I'm probably a little pessimistic about this but I think it might be less code to add indeed but takes equal amount of effort to make sure we've covered all the cases (given how many GC roots we've missed in the past...)

@yuyichao
Copy link
Contributor Author

yuyichao commented Dec 3, 2015

Another thing I'm worrying is that what would happen if the thread haven't got a change to rerun the segfault'd load before the next GC is triggered?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 3, 2015

it should re-trigger the statepoint, which would be fine. there might be a race condition with testing and setting jl_gc_running / ptls->state though

@JeffBezanson
Copy link
Sponsor Member

Will this, for example, require switching gc states around every ccall? I would not be ok with that.

I think we could stage this by first just merging the new gc waiting mechanism, with minimal state points: only in jl_gc_collect and when entering or exiting threaded regions. Then the statepoints can be slow, we can leave the trickier parts for later, and in the meantime threading will be much more stable.

@yuyichao
Copy link
Contributor Author

yuyichao commented Dec 4, 2015

Will this, for example, require switching gc states around every ccall?

No

I would not be ok with that.

Me neither.

The plan is to run in unmanaged mode by default and only add transitions around critical regions (write barrier and gc frame store) and then merge the critical regions if there's no code in between that has to be running in unmanaged mode (edit: ccall of function that doesn't have jl_value_t* argument or return value). This way we only do the expensive transition when there's more expensive operations (note that the source of the store are usually also expensive).

The worst case is of course if you have alternating code that need to be managed and unmanaged and if the source of the store is actually cheap (in which case the conservative scan should perform well). I did a small benchmark to check what the overhead would be with the code below (note that in this case we can hoist the transition out of the loop and only leave the safepoint check in the loop but it should be the worst case we could ever have).

The timing on my laptop shows that f1 is ~10% faster then f2 (no difference if I hoist the transition) when I'm only setting the field a. There's no measurable difference if I set both field.

type Obj{T}
    a::T
    b::T
end

function f1(n, p1, p2, obj, a, b)
    for i in 1:n
        obj.a = a
        # obj.b = b
    end
end

function f2(n, p1, p2, obj, a, b)
    for i in 1:n
        Base.llvmcall("""
                      store volatile i8 3, i8* %0
                      %3 = load volatile i64, i64* %1
                      ret void
                      """, Void, Tuple{Ptr{Int8},Ptr{Int64}},
                      Ptr{Int8}(p1), Ptr{Int64}(p2))
        obj.a = a
        # obj.b = b
        Base.llvmcall("""
                      store volatile i8 0, i8* %0
                      ret void
                      """, Void, Tuple{Ptr{Int8}}, Ptr{Int8}(p1))
    end
end

using Benchmarks

a = Ref(1)
b = Ref(1)
obj = Obj(a, b)
b1 = Ref(1)
b2 = Ref(1)
p1 = pointer_from_objref(b1)
p2 = pointer_from_objref(b2)

@show @benchmark f1(10, p1, p2, obj, a, b)
@show @benchmark f2(10, p1, p2, obj, a, b)

edit:
The value of the volatile store is actually wrong but that doesn't affect the performance.

@JeffBezanson
Copy link
Sponsor Member

Thanks, I understand it better now.

@yuyichao yuyichao force-pushed the yyc/gc/safepoint branch 6 times, most recently from 486f196 to 635b995 Compare December 6, 2015 04:36
@yuyichao
Copy link
Contributor Author

Rebased on top of #14501 . Also fixes other TLS references in the signal handling thread on OSX.

@@ -194,7 +194,9 @@ class JuliaJITEventListener: public JITEventListener
virtual void NotifyFunctionEmitted(const Function &F, void *Code,
size_t Size, const EmittedFunctionDetails &Details)
{
int8_t gc_state = jl_gc_state_save_and_set(3);
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 3 should have a name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also for the 1.

I also change it to 2 instead since I don't think we need it to be a bit flag (which was why I use 3 instead of 2 to begin with).

@yuyichao yuyichao force-pushed the yyc/gc/safepoint branch 2 times, most recently from 428d883 to ac55af5 Compare December 30, 2015 15:33
@yuyichao
Copy link
Contributor Author

yuyichao commented Jan 5, 2016

Rebased and CI all pass. Given that there's only very minor comments and updates in the past two weeks and people have probably back to work now, I'll merge this sometime later this week. It is very hard to make further progress on threading without merging the current PR's.

@JeffBezanson
Copy link
Sponsor Member

Yes, let's merge this. Amazing work @yuyichao.

I have a couple more questions that are not urgent and don't block merging:

  • Where is the initial gc state for the master thread set? I expected to see something in ti_initthread.
  • It looks like waiting for gc uses busy waiting? Can that be fixed at some point?

JeffBezanson added a commit that referenced this pull request Jan 5, 2016
GC safepoint and transition support
@JeffBezanson JeffBezanson merged commit 6cc48dc into master Jan 5, 2016
@yuyichao
Copy link
Contributor Author

yuyichao commented Jan 5, 2016

Where is the initial gc state for the master thread set? I expected to see something in ti_initthread.

I assumed that the tls buffer is 0 initialized since it's a static variable in C. Should we explicitly initialize that?

It looks like waiting for gc uses busy waiting? Can that be fixed at some point?

Right, that follows from what we use for our runtime lock. I don't like this busy loop either and I think both should be fixed (especially for the cases we are doing expensive stuffs in the runtime). FWIW, the pthread lock performs better than our spin lock on my laptop when I was benchmarking #14451.

@yuyichao yuyichao deleted the yyc/gc/safepoint branch January 5, 2016 18:42
@JeffBezanson
Copy link
Sponsor Member

Should we explicitly initialize that?

Yes I think it's clearer, just for the sake of reading the init code and wanting to know the initial state.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jan 5, 2016

Done #14565

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:multithreading Base.Threads and related functionality GC Garbage collector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants