Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a GC safepoint in Task.wait #41441

Merged

Conversation

vilterp
Copy link
Contributor

@vilterp vilterp commented Jul 1, 2021

This gives GC a chance to run during a task switch,
in addition to being triggered by allocations.

Without this, non-allocating tasks may fail to stop for
a long time, even if they're doing IO or calling yield
explicitly, preventing us from stopping the world for GC.

Fixes #40972 (I think?)

cc @NHDaly @dewilson @vchuravy

This gives GC a chance to run during a task switch,
in addition to being triggered by allocations.

Without this, non-allocating tasks may fail to stop for
a long time, even if they're doing IO or calling `yield`
explicitly, preventing us from stopping the world for GC.
@@ -818,6 +818,7 @@ end
end

function wait()
GC.safepoint()
Copy link
Contributor Author

@vilterp vilterp Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the right place to put this, but it prevents the shell from hanging when we pasted in the simple program in the linked issue.

Let me know if somewhere else in this function, or another function is better!

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an unconditional safepoint in process_events() after the yield, but I suppose if the code is not returning to an existing Task, we would not reach that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting, good to know! I see it now, here:

jl_gc_safepoint_(ct->ptls);

Was that added after Julia 1.6? We were experiencing some long pauses in prod on 1.6, which is what got us looking into this. But seems like it's already been addressed on master?

In any case, as you say, it seems like pausing for GC before task yielding would give slightly more opportunities for GC to run, but i agree that it doesn't seem much different.

@NHDaly NHDaly added GC Garbage collector multithreading Base.Threads and related functionality labels Jul 1, 2021
@NHDaly
Copy link
Member

NHDaly commented Jul 1, 2021

The example problem we ran from #40972 hangs on Julia 1.6.

We tried running it on master, and it seems like it doesnt hang at first, but then it locks my terminal after a few seconds, even to the point where it can't be interrupted with ctrl-z.

After this PR, it seems to work correctly again.


To add some more context, my understanding is that yield() used to be a GC yield point as well, because there were incidental allocations in it's implementation. But as it's been optimized over time, the allocations have been removed, and so it no longer contains any GC yields. This PR attempts to add back an explicitGC yield-point, to Task yield, to prevent the situation where well-tuned, non-allocating code can run for a long time and starve GC.

Since Task yield is a relatively expensive operation compared to gc_safepoint, this seems like a safe place to inject it.

@vtjnash does that seem reasonable to you as well? :) Thanks!

@JeffBezanson
Copy link
Sponsor Member

We tried running it on master, and it seems like it doesnt hang at first, but then it locks my terminal after a few seconds, even to the point where it can't be interrupted with ctrl-z.

Interesting; I can reproduce the problem on 1.6 but not this behavior.

@JeffBezanson
Copy link
Sponsor Member

This is very cheap though so might as well merge it.

@JeffBezanson JeffBezanson merged commit 97f817a into JuliaLang:master Jul 7, 2021
Copy link
Member

@NHDaly NHDaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cheap though so might as well merge it.

Oh wow, thanks @JeffBezanson! Sorry for my lack of reply; i was out for the long holiday weekend. I'm leaving some comments inline.


We tried running it on master, and it seems like it doesnt hang at first, but then it locks my terminal after a few seconds, even to the point where it can't be interrupted with ctrl-z.

Interesting; I can reproduce the problem on 1.6 but not this behavior.

Yeah, weird, right? The program @vilterp posted in #40972 (comment) consistently locks my REPL after a few more commands, every time. I had julia built on the following commit: ed4c44f.

I just rebuilt at the commit right before this PR, 480ff81, and it doesn't hang there. So I don't think it was related to the changes in this PR.

But anyway, thanks for the merge!

@@ -818,6 +818,7 @@ end
end

function wait()
GC.safepoint()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting, good to know! I see it now, here:

jl_gc_safepoint_(ct->ptls);

Was that added after Julia 1.6? We were experiencing some long pauses in prod on 1.6, which is what got us looking into this. But seems like it's already been addressed on master?

In any case, as you say, it seems like pausing for GC before task yielding would give slightly more opportunities for GC to run, but i agree that it doesn't seem much different.

@NHDaly
Copy link
Member

NHDaly commented Jul 8, 2021

JeffBezanson added the backport 1.7 label 2 hours ago

Also, unclear if this qualifies as a "bug fix", or whatever the criteria is for backport, but we've backported this to 1.6 in our production builds, and it seemed to apply cleanly, so we could consider backport 1.6 as well if you think that makes sense?

@KristofferC KristofferC added the backport 1.6 Change should be backported to release-1.6 label Jul 8, 2021
@Sacha0
Copy link
Member

Sacha0 commented Jul 8, 2021

Was that added after Julia 1.6? We were experiencing some long pauses in prod on 1.6, which is what got us looking into this. But seems like it's already been addressed on master?

Git blame suggests that the jl_uv.c:210 safepoint was added in the pull request that enabled task migration, i.e. after 1.6 branched but happily before 1.7 branched. Testing on 1.7-beta3 might be worthwhile? :)

@NHDaly
Copy link
Member

NHDaly commented Jul 8, 2021

Good suggestion, @Sacha0. Cool, so:

I just checked in v1.7-beta2, and I consistently experienced the same "hangs after a few more entries into the REPL" problem.
But then I updated to v1.7-beta3, and that problem is gone; i tried 5 times and never experienced it.

So whatever it was that fixed it on master, has also fixed it on v1.7-beta3. :) And it indeed wasn't related to this PR after all.

EDIT: And of course, as you pointed out, that gc_safepoint line is already in there in 1.7, so in neither beta2 nor beta3 did I experience the original hang reported in #40972. 👍

KristofferC pushed a commit that referenced this pull request Jul 12, 2021
@KristofferC KristofferC mentioned this pull request Jul 12, 2021
75 tasks
KristofferC pushed a commit that referenced this pull request Jul 19, 2021
KristofferC pushed a commit that referenced this pull request Jul 19, 2021
KristofferC pushed a commit that referenced this pull request Aug 31, 2021
@KristofferC KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Sep 7, 2021
@LarkAnspach

This comment has been minimized.

@vchuravy vchuravy deleted the vilterp/gc-safepoint-in-wait branch September 23, 2021 20:04
staticfloat pushed a commit that referenced this pull request Dec 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock in Profile on MacOS with multithreaded program
7 participants