[Jobs] Stopping a Job with SIGTERM doesn't work as expected #31274

zcin · 2022-12-21T18:19:39Z

What happened + What you expected to happen

The following is how we start a new Ray Job:

child_process = subprocess.Popen(
    self._entrypoint,
    shell=True,
    start_new_session=True,
    stdout=logs_file,
    stderr=subprocess.STDOUT,
)

But because we use shell=True, this will kick off the following two processes (assuming _entrypoint="python script.py"):
[1] /bin/sh -c python script.py (child_process points to this process)
[2] python script.py

So when we terminate the process group that contains both process 1 and 2, process 1 will terminate, but process 2 (the actual job) may handle SIGTERM in a custom way. But we only track the status of child_process = process 1, so the job is assumed to be done and the actor is killed -> all processes are killed.

Versions / Dependencies

Linux

Reproduction script

You can try to submit the following script.py as a job:

import sys
import signal
import time

def handler(*args):
    start = time.time()
    while time.time() - start < 5:
        print("[s] handling SIGTERM signal ...", time.time() - start)
        time.sleep(0.1)
    sys.exit()

signal.signal(signal.SIGTERM, handler)
while True:
    print('[s] Waiting...')
    time.sleep(1)

When the job is stopped, this should attempt to wait 5 seconds, but then get forcefully killed at 3 seconds. Instead the logs show that the job gracefully terminated by itself.

Issue Severity

None

The text was updated successfully, but these errors were encountered:

zcin · 2022-12-21T18:24:31Z

@architkulkarni @sihanwang41 @edoakes Opening this so it can be tracked. Two options discussed so far:

Change shell=True to shell=False when starting a new Job. This leads to quite a big change in functionality because we have already established Jobs to accept any type of shell command you can type into your console.
Terminate only process 2 when stopping the job. We can continue to track process 1 because it will exit when process 2 exits. This seems to be the preferred, safer way.

architkulkarni · 2022-12-21T20:53:32Z

I see, so we're currently effectively sending SIGTERM as a fire-and-forget and never sending SIGKILL?

Option 2 sounds like a good option, but if we kill process 2 instead of the whole process group, do we need to worry about child processes of process 2 staying alive?

zcin · 2022-12-21T23:24:50Z

The actual job (process 2) never gets explicitly killed, until the actor itself exits.

That's a valid concern for option 2. Seems like before this we also didn't take care of child processes, because we just send SIGKILL to process 1. We could possibly send SIGTERM signal to all the child processes of process 2, but in that case would it make sense to track all child processes here, instead of tracking only process 1? (by track I mean send SIGTERM, wait for the process to exit, and send SIGKILL after timeout) Either way, it seems that when the actor exits, all processes are force killed.

architkulkarni · 2022-12-21T23:58:36Z

Seems like before this we also didn't take care of child processes, because we just send SIGKILL to process 1

Ah true, but I think before the SIGTERM PR we were sending SIGKILL to the entire process group.

it seems that when the actor exits, all processes are force killed.

That's surprising, any ideas how this happens?

Maybe the first question is what the user would expect the behavior to be. If a non-Ray user runs a command that starts some background child processes, then when the process for the command exits (naturally or by being killed) the child processes won't die, and this is expected. In that case, maybe we should have the same behavior in Ray Jobs, where we only kill the entrypoint process and don't take care of child processes from the user's entrypoint command. @zcin @edoakes any thoughts on this? Would this scenario happen in the real world?

If we do need to kill all child processes, then if you know a way to track all of them (or track the "process group" somehow), then I think that would be the right thing to do.

zcin · 2022-12-22T00:25:32Z

it seems that when the actor exits, all processes are force killed.

That's surprising, any ideas how this happens?

Hmm, not sure. If this is surprising I should probably double check, it was what I observed after many tests so I had just assumed.

zcin · 2022-12-22T00:34:23Z

Ah, we send SIGKILL to the entire process group when the actor exits.

zcin · 2022-12-22T01:29:17Z

I experimented, and we can use psutil to track process 2 instead of the process 1 (since there should be exactly one child of process 1). Then, the question remains whether we want to send SIGTERM to the process group, and if so, if we want to track all the child processes before sending SIGKILL.

sihanwang41 · 2022-12-22T02:04:51Z

Hi @zcin, option2 sounds good to me. we can have a timeout for all the child process, if timeout, we send SIGKILL and kill all progress in the same group.

architkulkarni · 2022-12-22T18:14:27Z

Ah, we send SIGKILL to the entire process group when the actor exits.

Ah nice, at first I thought it was some generic thing for all Ray actors which would have been surprising. This makes more sense.

Maybe we can make the decision by just preserving the original behavior as much as possible (where we SIGKILL the entire group.) If we see a strong need (like a user request for the background processes scenario) then we can consider changing it.

So to summarize what you/Sihan have come up with (let me know if this is off):

send SIGTERM to all the child processes
Track all of them, if any are still alive after timeout, SIGKILL the group

If this turns out to be hard or complicated, maybe Step 2 could be replaced by "Track process 2 and if it's still alive, SIGKILL the group", and it would be the user's responsibility to make sure child processes of process 2 exit whenever process 2 exits.

zcin · 2022-12-22T20:03:53Z

I'm wondering even if we for sure want to execute Step 1; is it possible that the user's job is itself tracking its child processes, and upon receiving a SIGTERM signal might want to terminate their child processes in a custom way? Do we want to give that option to the user?

architkulkarni · 2022-12-22T21:17:55Z

That's a good point, I'm not sure (@edoakes @sihanwang41 thoughts?)

Two weak points in favor of signaling the whole process group:

From searching around it seems like sending SIGTERM signals to a process group is a common pattern,
We haven't received any negative user feedback about the original approach (kill the whole process group)

My vague impression is that this is a corner case which we should be able to change in the future without too much trouble if we get feedback on it.

rkooo567 · 2022-12-22T23:57:34Z

#31274 (comment)

This should be the standard way of all terminations. It is also how ray stop works now.

Also, I think sending signals to the group seems more useful in general. If there's a compelling use case of managing child proc themselves, we should support with a flag, but I think defaulting it to this option seems to be more prone to error for many users who don't know about this

Resolves the issue described in #31274. On Linux systems, when a stop signal is sent, instead of killing + waiting on only the shell process (which starts the actual job as a child process), we want to kill all the children of the shell process along with the shell process itself, and poll all processes until they exit or send a force SIGKILL on timeout. (This change is compatible with Mac OSX systems as well) Co-authored-by: shrekris-anyscale <[email protected]>

Resolves the issue described in ray-project#31274. On Linux systems, when a stop signal is sent, instead of killing + waiting on only the shell process (which starts the actual job as a child process), we want to kill all the children of the shell process along with the shell process itself, and poll all processes until they exit or send a force SIGKILL on timeout. (This change is compatible with Mac OSX systems as well) Co-authored-by: shrekris-anyscale <[email protected]> Signed-off-by: tmynn <[email protected]>

zcin added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 21, 2022

rkooo567 added the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Dec 21, 2022

sihanwang41 added P1 Issue that should be fixed within a few weeks core-clusters For launching and managing Ray clusters/jobs/kubernetes labels Dec 22, 2022

zcin mentioned this issue Dec 29, 2022

[Jobs] Track real job process #31306

Merged

7 tasks

edoakes closed this as completed in #31306 Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Jobs] Stopping a Job with SIGTERM doesn't work as expected #31274

[Jobs] Stopping a Job with SIGTERM doesn't work as expected #31274

zcin commented Dec 21, 2022

zcin commented Dec 21, 2022 •

edited

Loading

architkulkarni commented Dec 21, 2022 •

edited

Loading

zcin commented Dec 21, 2022 •

edited

Loading

architkulkarni commented Dec 21, 2022

zcin commented Dec 22, 2022 •

edited

Loading

zcin commented Dec 22, 2022

zcin commented Dec 22, 2022

sihanwang41 commented Dec 22, 2022

architkulkarni commented Dec 22, 2022

zcin commented Dec 22, 2022

architkulkarni commented Dec 22, 2022

rkooo567 commented Dec 22, 2022

[Jobs] Stopping a Job with SIGTERM doesn't work as expected #31274

[Jobs] Stopping a Job with SIGTERM doesn't work as expected #31274

Comments

zcin commented Dec 21, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

zcin commented Dec 21, 2022 • edited Loading

architkulkarni commented Dec 21, 2022 • edited Loading

zcin commented Dec 21, 2022 • edited Loading

architkulkarni commented Dec 21, 2022

zcin commented Dec 22, 2022 • edited Loading

zcin commented Dec 22, 2022

zcin commented Dec 22, 2022

sihanwang41 commented Dec 22, 2022

architkulkarni commented Dec 22, 2022

zcin commented Dec 22, 2022

architkulkarni commented Dec 22, 2022

rkooo567 commented Dec 22, 2022

zcin commented Dec 21, 2022 •

edited

Loading

architkulkarni commented Dec 21, 2022 •

edited

Loading

zcin commented Dec 21, 2022 •

edited

Loading

zcin commented Dec 22, 2022 •

edited

Loading