Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(workers): Make worker.terminate() not immediately kill the isolate #12831

Merged
merged 10 commits into from
Nov 29, 2021

Conversation

andreubotella
Copy link
Contributor

Due to a bug in V8, terminating an isolate while a module with top-level await is being evaluated would crash the process. This change makes it so calling worker.terminate() will signal the worker to terminate at the next iteration of the event loop, and it schedules a proper termination of the worker's isolate after 2 seconds.

Closes #12658

Andreu Botella added 3 commits November 21, 2021 22:55
Due to a bug in V8, terminating an isolate while a module with top-level
await is being evaluated would crash the process. This change makes it
so calling `worker.terminate()` will signal the worker to terminate at
the next iteration of the event loop, and it schedules a proper
termination of the worker's isolate after 2 seconds.

Closes denoland#12658
@bartlomieju bartlomieju added this to the 1.17.0 milestone Nov 21, 2021
Copy link
Member

@bartlomieju bartlomieju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a sensible solution, especially if it fixed the problem at hand. @lucacasonato @bnoordhuis PTAL

@andreubotella
Copy link
Contributor Author

cargo test was not failing on my machine, but it was on CI before 94bc4e4, so I'd want to run CI a couple more times to be confident that I'm not introducing flakes.

Copy link
Contributor

@bnoordhuis bnoordhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How pressing is this bug? My natural inclination is to wait for the upstream fix (the V8 bug had some movement earlier today) than try to work around it.

Comment on lines +148 to 157
pub fn terminate_if_needed(&mut self) -> bool {
let has_terminated = self.is_terminated();

if !has_terminated && self.termination_signal.load(Ordering::SeqCst) {
self.terminate();
return true;
}

has_terminated
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can end up calling self.terminate() repeatedly, can't it? That seems bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

terminate_if_needed() will at most call self.terminate() once – there are no loops or recursion in this method. And if self.terminate() has been previously called, self.is_terminated() will return true, so self.terminate() won't be called again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, let me clarify: calling terminate_if_needed() twice can result in two calls to terminate().

That's okay when terminate() is idempotent but I don't think it is, and even if it is, then that's an implicit contract that's easy to miss when making changes later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.terminate() will set self.has_terminated to true, which will keep any subsequent call to self.terminate_if_needed() from calling self.terminate() again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the implicit contract I mean. Something a little less prone to snafus would be appreciated.

A rule of thumb I follow is this: whenever more than one atomic variable pops up, try really hard to replace it with a mutex and a proper critical section.

@andreubotella
Copy link
Contributor Author

How pressing is this bug? My natural inclination is to wait for the upstream fix (the V8 bug had some movement earlier today) than try to work around it.

I guess it's not too pressing. But this isn't the only bug related to isolate termination that we've had (see #12263), and considering that this PR changes worker termination to work the same as Chrome, I'd say this might be better in the long run.

Copy link
Member

@bartlomieju bartlomieju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I discussed this PR offline with @bnoordhuis who's not to keen on landing it, but since it's fixing an actual bug we agreed to merge it.

@bartlomieju bartlomieju merged commit 4a13c32 into denoland:main Nov 29, 2021
bartlomieju pushed a commit to bartlomieju/deno that referenced this pull request Nov 29, 2021
…ate (denoland#12831)

Due to a bug in V8, terminating an isolate while a module with top-level
await is being evaluated would crash the process. This change makes it
so calling `worker.terminate()` will signal the worker to terminate at
the next iteration of the event loop, and it schedules a proper
termination of the worker's isolate after 2 seconds.
@andreubotella andreubotella deleted the schedule-termination branch November 29, 2021 13:03
bnoordhuis pushed a commit that referenced this pull request Nov 29, 2021
…ate (#12831)

Due to a bug in V8, terminating an isolate while a module with top-level
await is being evaluated would crash the process. This change makes it
so calling `worker.terminate()` will signal the worker to terminate at
the next iteration of the event loop, and it schedules a proper
termination of the worker's isolate after 2 seconds.

Co-authored-by: Andreu Botella <[email protected]>
@ry
Copy link
Member

ry commented Dec 1, 2021

This PR resulted in many more threads being used in the "workers_large_message" benchmark:

Screen Shot 2021-12-01 at 8 34 38 AM

@andreubotella
Copy link
Contributor Author

I decided to spawn a thread rather than start a sleep future in the tokio event loop in case the host was terminated in the meantime, but if the increase in threads is a problem, we can revisit this.

andreubotella added a commit to andreubotella/deno that referenced this pull request Mar 14, 2022
Calling `worker.terminate()` used to kill the worker's isolate and
then block until the worker's thread finished. This blocks the calling
thread if the worker's event loop was blocked in a sync op (as with
`Deno.sleepSync`), which wasn't realized at the time, but since the
worker's isolate was killed at that moment, it would not block the
calling thread if the worker was in a JS endless loop.

However, in denoland#12831, in order to work around a V8 bug, worker
termination was changed to first set a signal to let the worker event
loop know that termination has been requested, and only kill the
isolate if the event loop has not finished after 2 seconds. However,
this change kept the blocking, which meant that JS endless loops in
the worker now blocked the parent for 2 seconds.

As it turns out, after denoland#12831 it is fine to signal termination and
even kill the worker's isolate without waiting for the thread to
finish, so this change does that. However, that might leave the async
ops that receive messages and control data from the worker pending
after `worker.terminate()`, which leads to odd results from the op
sanitizer. Therefore, we set up a `CancelHandler` to cancel those ops
when the worker is terminated.

Fixes denoland#13705.
bartlomieju pushed a commit that referenced this pull request Apr 27, 2022
…13941)

Calling `worker.terminate()` used to kill the worker's isolate and
then block until the worker's thread finished. This blocks the calling
thread if the worker's event loop was blocked in a sync op (as with
`Deno.sleepSync`), which wasn't realized at the time, but since the
worker's isolate was killed at that moment, it would not block the
calling thread if the worker was in a JS endless loop.

However, in #12831, in order to work around a V8 bug, worker
termination was changed to first set a signal to let the worker event
loop know that termination has been requested, and only kill the
isolate if the event loop has not finished after 2 seconds. However,
this change kept the blocking, which meant that JS endless loops in
the worker now blocked the parent for 2 seconds.

As it turns out, after #12831 it is fine to signal termination and
even kill the worker's isolate without waiting for the thread to
finish, so this change does that. However, that might leave the async
ops that receive messages and control data from the worker pending
after `worker.terminate()`, which leads to odd results from the op
sanitizer. Therefore, we set up a `CancelHandler` to cancel those ops
when the worker is terminated.
crowlKats pushed a commit that referenced this pull request Apr 28, 2022
…13941)

Calling `worker.terminate()` used to kill the worker's isolate and
then block until the worker's thread finished. This blocks the calling
thread if the worker's event loop was blocked in a sync op (as with
`Deno.sleepSync`), which wasn't realized at the time, but since the
worker's isolate was killed at that moment, it would not block the
calling thread if the worker was in a JS endless loop.

However, in #12831, in order to work around a V8 bug, worker
termination was changed to first set a signal to let the worker event
loop know that termination has been requested, and only kill the
isolate if the event loop has not finished after 2 seconds. However,
this change kept the blocking, which meant that JS endless loops in
the worker now blocked the parent for 2 seconds.

As it turns out, after #12831 it is fine to signal termination and
even kill the worker's isolate without waiting for the thread to
finish, so this change does that. However, that might leave the async
ops that receive messages and control data from the worker pending
after `worker.terminate()`, which leads to odd results from the op
sanitizer. Therefore, we set up a `CancelHandler` to cancel those ops
when the worker is terminated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

worker core dumped panic
4 participants