Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: fix race condition on task failure #12736

Closed
wants to merge 1 commit into from
Closed

Conversation

JeffBezanson
Copy link
Sponsor Member

We tried adding this message. It was a worthwhile experiment and seemed like a good idea, even to me, but at this point I just can't take it any more.

The message is a race condition, because the order of task T running and another task calling wait(T) now matters, where it didn't before. This has led to the need for a SUPPRESS_EXCEPTION_PRINTING flag, which is just silly.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Aug 21, 2015

ref #12403

@StefanKarpinski
Copy link
Sponsor Member

So this means that if a Task has an error and no one waits for it, it vanishes into the void?

@JeffBezanson
Copy link
Sponsor Member Author

Basically yes, but wait isn't the only way to detect the error. You can print the task itself (which will now even show a backtrace) or examine its status field.

@StefanKarpinski
Copy link
Sponsor Member

That seems fair. What about printing the error if the Task is GC:ed without being waited on and possibly upon process shut down if not?

@JeffBezanson
Copy link
Sponsor Member Author

With those other behaviors, we would still need the SUPPRESS_EXCEPTION_PRINTING flag.

Tying stuff to GC is no good, since it is wildly unpredictable.

FWIW, I believe erlang does the same thing, keeping process failures silent unless another process is watching.

@StefanKarpinski
Copy link
Sponsor Member

If Erlang does it that way, it's probably the right thing to do, but it does worry my letting things fail and fall into the void.

@ScottPJones
Copy link
Contributor

That's part of Erlang's whole model, I not sure it would be truly applicable to Julia. Silent failures give me the creeps.

@JeffBezanson
Copy link
Sponsor Member Author

I think of it in terms of general asynchronous programming. For example, what should

remotecall(2, error, "oops")

do? Indeed, it just sits there seemingly doing nothing until you try to fetch the answer. I don't recall many complaints about this.

@Keno
Copy link
Member

Keno commented Aug 23, 2015

I'm a little worried about uses cases such as:

@async while true
listen(...)
...
end

There should probably be a good way to print the error message. Maybe just a macro that wraps everything in try catch and calls the appropriate show_backtrace magic.

@JeffBezanson
Copy link
Sponsor Member Author

I think the best approach is something like

function monitor(t::Task)
    try
        wait(t)
    catch e
        # show exception info
    end
end

You want to use wait, since that even handles the case where a task gets an exception before it starts.

@ScottPJones
Copy link
Contributor

Could the starting process asynchronously get the message (in a thread?), and keep it until the main thread wait()'s on it?
That way a worker process could go away immediately after delivering it's final message, freeing up resources.

@JeffBezanson
Copy link
Sponsor Member Author

Yes that sounds possible, but it seems to only affect resource use and not when errors get printed.

@ScottPJones
Copy link
Contributor

I've had cases of filling up process tables or hitting OS limits on number of processes on different platforms, which is why that can be important

@amitmurthy
Copy link
Contributor

Agreeing with Keno's concern. The printing of errors is useful when using @schedule calls - where we do not really want to keep a reference to the started task, but would like any errors to be printed to screen (from a error reporting / debugging point of view).

@amitmurthy
Copy link
Contributor

I am OK with printing unhandled errors based on an an environment flag - say, JULIA_DEBUG=0/1, with the default being off, i.e., no printing. This addresses development time issues, where sometimes misspelling a variable name, or other syntactical errors can be discovered quickly. If you expect runtime errors, then you should wrap your async block in a try-catch and handle it properly.

@JeffBezanson
Copy link
Sponsor Member Author

Some kind of environment var would be ok I guess.

One reason I find this important is that it's actually not just a matter of whether errors are expected within a certain task. It's about order of events. For example sync_add needs to set SUPPRESS_EXCEPTION_PRINTING because we're spawning several tasks, then waiting on each. A task might fail in the middle of the process.

@StefanKarpinski
Copy link
Sponsor Member

Can't the way you suppress raising exceptions be to explicitly wait on a task and wrap the wait in try/catch? It seems fine to me to make it hard to ignore exceptions.

@JeffBezanson
Copy link
Sponsor Member Author

No, because a task usually starts before anybody calls wait. That's the race condition we have.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 26, 2021

Implemented in #39518

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants