Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use safepoint to deliver SIGINT #16174

Merged
merged 9 commits into from
May 6, 2016
Merged

Conversation

yuyichao
Copy link
Contributor

@yuyichao yuyichao commented May 3, 2016

This implements @vtjnash 's idea of delivering sigint in a safer way, without having to add sigatomic to everywhere. This also solves a few performance issues along the way. A brief summary of the differences,

  • Make GC safepoint thread local

    And create 3 signal pages. See the comment in safepoint.c for detail.
    This have a very minor performance hit for accessing safepoint. However, I think we can avoid the safepoint and GC transition around gcframe setup and gcframe pop on x86 so this shouldn't be a big issue. (On ARM, the atomic release store is more expensive than two GC safepoint so we may use a normal store with two safepoint instead, the performance hit of making the safepoint thread local is still very minor in this case though (~5-10% for empty GC frame push/pop, cheaper than the tls getter call....)).

  • Use the GC safepoint mechanism (i.e. SegFault) to deliver InterruptException at known points

    This automatically makes ccall of external library sigatomic. (Fix Segfault when aborting matrix multiplication #1468, Fix make ccall sigatomic (defer SIGINT handling) #2622). In order to avoid waiting too long before the exception is delivered when running unmanaged code, this also implement waking up libuv, abort syscall and abort libsupport io (this use a hack but can be generalized if needed) when sigint arrives. This also implement force throwing of exception if SIGINT arrives too frequently so that one can use Ctrl-C to force abort some dead loop. A warning will be printed in such case since it bypass the safe path and even sigatomic.

    The most important consequence IMHO is that pressing Ctrl-C during sysimg compilation should no longer segfault =)

    This should also make it easier to implement signal handling #14675.

  • Make parser and type inference calls sigatomic

    sigatomic is much cheaper (see below) now so it is now used to protect important runtime code to not be interrupted by Ctrl-C.

  • Clean up safepoint and sigatomic, optimize try-catch and sigatomic

    • Split out safepoint.c, make defer_signal thread local and avoid the expensive atomic ops

    • Make defer_signal task local and automatically restore on expection. This eliminate the try-catch needed before in disable_sigint.

      This is technically breaking but sigatomic_begin and sigatomic_end aren't exported.

    • Inline jl_sigatomic_begin and jl_sigatomic_end in codegen.

    Benchmarking try-catch and disable_sigint

    julia> @inline f1() = try end
    f1 (generic function with 1 method)
    
    julia> @inline f2() = disable_sigint(()->nothing)
    f2 (generic function with 1 method)
    
    julia> g1(n) = for i in 1:n
               f1()
           end
    g1 (generic function with 1 method)
    
    julia> g2(n) = for i in 1:n
               f2()
           end
    g2 (generic function with 1 method)
    
    julia> function k(n)
               @time g1(n)
               @time g2(n)
           end
    k (generic function with 1 method)

    Timing before

    julia> k(10_000_000)
      0.264041 seconds
      0.401437 seconds

    Timing after

    julia> k(10_000_000)
      0.122592 seconds                                                                
      0.012506 seconds

Close #12333
Close #12309

// that throws a SIGINT.
jl_gc_state_save_and_set(eh->gc_state);
if (eh->defer_signal == 0) {
jl_sigint_safepoint();
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm fairly certain we don't actually want catch to be a safepoint (now that it's avoidable)

Copy link
Contributor Author

@yuyichao yuyichao May 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sigint safepoint is kind of avoidable (by reenable the gc safepoint, edit: however, this would be expensive...) I don't think it's valid to avoid the GC safepoint in the gc state restore though.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine. i just don't think we want to throw an InterruptException before the catch block has a chance to install a try/catch of its own.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really provide the grantee that an async signal will not be thrown between a catch and a try though, unless we garentee it can only be compiled and not interpreted. The GC safepoint can also throw the exception even when there's not a sigint safepoint.

@stevengj
Copy link
Member

stevengj commented May 3, 2016

Suppose I have a ccall with a Julia callback function, e.g. ccall(:my_numeric_integration, Cdouble, (Ptr{Void},), cfunction(my_julia_integrand, ...)). If I send a SIGINT while this is executing, does this mean that it will not interrupt the C code, but an InterruptException will still occur during the my_julia_integrand execution?

Also, if you have multiple Tasks running, is there any way to indicate which Task you want to catch the exception? (This is an issue in IJulia.)

@yuyichao
Copy link
Contributor Author

yuyichao commented May 3, 2016

A callback can still throw exception so sigatomic_begin/end are still useful. I made it fast because of this...

The sigint is always delivered to current task on the master thread if it doesn't block sigint (I realized that I might have missed a pending sigint check there, similar to the one in en_restore_state). However, you can block the sigint on the task you don't want it to be delivered by calling sigatomic_begin, is this good enough?

@stevengj
Copy link
Member

stevengj commented May 3, 2016

I'm still confused by the callback situation in this PR. If I want InterruptExceptions to be thrown during my callback (e.g. I have a try-catch block that handles them correctly), do I need to call sigatomic_end/begin in my callback?

@yuyichao
Copy link
Contributor Author

yuyichao commented May 3, 2016

For julia callback from C, I think a much faster sigatomic is as far as what this approach can provide.

For the task issue, I'm wondering if we can make non-root tasks sigatomic by default? This way any frontend will need to explicitly reenable sigint on tasks that want to handle sigint.

@yuyichao
Copy link
Contributor Author

yuyichao commented May 3, 2016

(e.g. I have a try-catch block that handles them correctly), do I need to call sigatomic_end/begin in my callback?

A sigint can be delivered in the callback if the C caller wasn't called in sigatomic region. It is a little unreliably though since in principle a sigint can be delivered before the try block starts so I think to do this reliably you need to call the C function with sigatomic and reenable sigatomic in the callback after you started the try block.

@stevengj
Copy link
Member

stevengj commented May 3, 2016

I'm not sure if non-root tasks should be sigatomic by default. But it would be nice to have an option to make the task sigint-unaware when it is created. e.g. @async_nosigint

I'm not sure how calling sigatomic_begin/end will help in Tasks, right now. The problem is that sigatomic_end is still called within the Task, and as soon as it is called any deferred SIGINT will be thrown as an exception in that Task. What if I don't want that Task to get InterruptExceptions at all (but I still want them received in other Tasks)?

And what if several tasks are blocking on I/O, and I want a SIGINT to wake up a certain task to catch the n InterruptException?

@stevengj
Copy link
Member

stevengj commented May 3, 2016

you need to call the C function with sigatomic and reenable sigatomic in the callback after you started the try block.

Yes, that's what I'm doing now.

@yuyichao
Copy link
Contributor Author

yuyichao commented May 3, 2016

You can call sigatomic_begin without calling end to permanently block sigint on a task. Doing that at initialization time is also possible since I accidentally made defer_signal a property of the task (instead of just save it on the stack on task switch....)

@yuyichao yuyichao force-pushed the yyc/threads/safepoint-signal branch 2 times, most recently from 1579cfe to 033cfc1 Compare May 4, 2016 16:10
@vtjnash
Copy link
Sponsor Member

vtjnash commented May 4, 2016

What is the sigatomic state of a new Task? e.g.

sigatomic_begin()
@async println("am I sigatomic?")
sigatomic_end()
yield()

@stevengj
Copy link
Member

stevengj commented May 4, 2016

@yuyichao, sigatomic_begin() will only block SIGINT on the current task? I thought it increments jl_defer_signal, which is global to all tasks?

@yuyichao
Copy link
Contributor Author

yuyichao commented May 4, 2016

sigatomic is now thread and task local. New task currently (on this branch) inherit the state from when it's started... since I haven't set fix start_task for this yet...

@yuyichao yuyichao force-pushed the yyc/threads/safepoint-signal branch 2 times, most recently from 23088cc to 1b9c1a2 Compare May 4, 2016 23:19
@yuyichao
Copy link
Contributor Author

yuyichao commented May 5, 2016

Travis linux32 failure is #15920

@yuyichao yuyichao force-pushed the yyc/threads/safepoint-signal branch 2 times, most recently from 246654c to d6035ef Compare May 5, 2016 04:36
while (!ptls->gc_state) {
// FIXME: The acquire load pairs with the release store
// in the signal handler of safepoint so we are sure that
// all the memory write on those threads are visible. However,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory writes? memory written ... is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe store is the right word?

@yuyichao yuyichao force-pushed the yyc/threads/safepoint-signal branch from d6035ef to 88f7a52 Compare May 5, 2016 14:39
@yuyichao
Copy link
Contributor Author

yuyichao commented May 5, 2016

Spelling mistakes fixed. Threading CI passed AppVeyor, travis. This should be ready for final review now.

Depending on the caller these functions can be called from unmanaged threads
that cannot touch GC state or having GC safepoint.
Add note and function prototypes about using safepoint to deliver SIGINT.
Preparing for using the safepoint to deliver SIGINT, which requires distinguish
between master and worker threads.
* Remove unnecessary sigatomic
* Make flisp calls sigatomic
* Make type inference calls sigatomic
* Refactor interthread communication through signal
* Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster
* Implement force signal throwing when `SIGINT` arrives too frequently
* Hack to abort io syscall on `SIGINT`

Fix #1468; Fix #2622; Towards #14675
@yuyichao yuyichao force-pushed the yyc/threads/safepoint-signal branch from 88f7a52 to c589e45 Compare May 6, 2016 02:17
@vtjnash vtjnash merged commit b67fa90 into master May 6, 2016
@vtjnash vtjnash deleted the yyc/threads/safepoint-signal branch May 6, 2016 15:05
@stevengj
Copy link
Member

stevengj commented May 6, 2016

It seems like the documentation of the sigatomic functions should be revisited. (I'd also like to see documentation of how to catch ctrl-c in callback functions, disable ctrl-c in certain tasks, etcetera.)

accum_weight = 0;
return 0;
}
double new_weight = accum_weight * exp(-(dt / 1e9)) + 0.3;
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly written by a physicist :)

@yuyichao
Copy link
Contributor Author

yuyichao commented May 6, 2016

I've also noticed one case where Ctrl-C can still segfault. It is when we throw the error without any handler, and the execution of on_exit overflows the signal handler stack..... (at least on linux where we throw directly from the signal handler frame). A easy fix would be setting up a top level try catch frame at initialization time. The alternative would be to return to a call to jl_throw from the signal handler instead. However, it seems that setcontext or makecontext using a signal context might be undefined....

@@ -267,6 +268,7 @@ static jl_ast_context_t *jl_ast_ctx_enter(void)

static void jl_ast_ctx_leave(jl_ast_context_t *ctx)
{
JL_SIGATOMIC_END();
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks --lisp; segfaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... ok.... will fix later today....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in case you need it now, I think making jl_sigint_safepoint noop should work around it for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 417e49a

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be tested if breaking it matters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run(pipeline(DevNull, $(Base.julia_cmd()) --lisp, DevNull)) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like that. might need to be unix only, would need to run it through CI first

@tkelman
Copy link
Contributor

tkelman commented May 6, 2016

This PR broke the Windows buildbots: https://build.julialang.org/builders/package_win6.2-x64/builds/247/steps/make/logs/stdio

Similar to #11727, but now bootstrap freezes when output is redirected instead of just being unable to run tests. @vtjnash @yuyichao please advise, we'll have to revert this if no solution can be found.

@yuyichao
Copy link
Contributor Author

yuyichao commented May 6, 2016

The only commit that touches libuv is the last one and reverting it seems to get rid of the problem for me locally (tested by piping it through tee) I'm checking what is the minimum change that triggers this now...

@yuyichao
Copy link
Contributor Author

yuyichao commented May 6, 2016

Something is fishy here. The buildbot seems to have been printing jl_uv_writecb warnings forever during compilation http:https://buildbot.e.ip.saba.us:8010/builders/package_win6.2-x64/builds/245/steps/make/logs/stdio The freezing happens on current master after printing the first warning. However, either with or without the commit reverted, the build passes without printing any warnings if inference.ji exists before I run make. (i.e. if I run make again after it freezes, the build passes without any warning).

@vtjnash
Copy link
Sponsor Member

vtjnash commented May 6, 2016

If you follow that issue, you'll discover it's an upstream bug where libuv forgot to initialize the synchronization object. I think it then corrupts its internal state and tries to report an error that didn't happen by forgetting about a write req that did happen.

@yuyichao
Copy link
Contributor Author

yuyichao commented May 7, 2016

If you follow that issue

Which issue? I didn't see any mentioning of synchronization object in #11727?

@vtjnash
Copy link
Sponsor Member

vtjnash commented May 7, 2016

you have to follow it somewhat far. there's a more concise explanation at libuv/libuv#629, in which I demonstrate how it would make nodejs to hang (for the same reason) in a one-liner.

@yuyichao
Copy link
Contributor Author

yuyichao commented May 7, 2016

Ah, that one, I somehow think you are talking about uv_async_t since apparently simply initializing (not even using it) one is causing the trouble on windows......

@tkelman
Copy link
Contributor

tkelman commented May 10, 2016

@yuyichao
Copy link
Contributor Author

I believe it's caused by a llvm bug (probably similar to https://llvm.org/bugs/show_bug.cgi?id=27545) easy to workaround by removing an optimization, need to first check if it's fixed on llvm svn...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants