Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jobs] Stopping a Job with SIGTERM doesn't work as expected #31274

Closed
zcin opened this issue Dec 21, 2022 · 12 comments · Fixed by #31306
Closed

[Jobs] Stopping a Job with SIGTERM doesn't work as expected #31274

zcin opened this issue Dec 21, 2022 · 12 comments · Fixed by #31306
Labels
bug Something that is supposed to be working; but isn't core-clusters For launching and managing Ray clusters/jobs/kubernetes P1 Issue that should be fixed within a few weeks triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@zcin
Copy link
Contributor

zcin commented Dec 21, 2022

What happened + What you expected to happen

The following is how we start a new Ray Job:

child_process = subprocess.Popen(
    self._entrypoint,
    shell=True,
    start_new_session=True,
    stdout=logs_file,
    stderr=subprocess.STDOUT,
)

But because we use shell=True, this will kick off the following two processes (assuming _entrypoint="python script.py"):
[1] /bin/sh -c python script.py (child_process points to this process)
[2] python script.py

So when we terminate the process group that contains both process 1 and 2, process 1 will terminate, but process 2 (the actual job) may handle SIGTERM in a custom way. But we only track the status of child_process = process 1, so the job is assumed to be done and the actor is killed -> all processes are killed.

Versions / Dependencies

Linux

Reproduction script

You can try to submit the following script.py as a job:

import sys
import signal
import time

def handler(*args):
    start = time.time()
    while time.time() - start < 5:
        print("[s] handling SIGTERM signal ...", time.time() - start)
        time.sleep(0.1)
    sys.exit()

signal.signal(signal.SIGTERM, handler)
while True:
    print('[s] Waiting...')
    time.sleep(1)

When the job is stopped, this should attempt to wait 5 seconds, but then get forcefully killed at 3 seconds. Instead the logs show that the job gracefully terminated by itself.

Issue Severity

None

@zcin zcin added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 21, 2022
@zcin
Copy link
Contributor Author

zcin commented Dec 21, 2022

@architkulkarni @sihanwang41 @edoakes Opening this so it can be tracked. Two options discussed so far:

  1. Change shell=True to shell=False when starting a new Job. This leads to quite a big change in functionality because we have already established Jobs to accept any type of shell command you can type into your console.
  2. Terminate only process 2 when stopping the job. We can continue to track process 1 because it will exit when process 2 exits. This seems to be the preferred, safer way.

@architkulkarni
Copy link
Contributor

architkulkarni commented Dec 21, 2022

I see, so we're currently effectively sending SIGTERM as a fire-and-forget and never sending SIGKILL?

Option 2 sounds like a good option, but if we kill process 2 instead of the whole process group, do we need to worry about child processes of process 2 staying alive?

@rkooo567 rkooo567 added the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Dec 21, 2022
@zcin
Copy link
Contributor Author

zcin commented Dec 21, 2022

The actual job (process 2) never gets explicitly killed, until the actor itself exits.

That's a valid concern for option 2. Seems like before this we also didn't take care of child processes, because we just send SIGKILL to process 1. We could possibly send SIGTERM signal to all the child processes of process 2, but in that case would it make sense to track all child processes here, instead of tracking only process 1? (by track I mean send SIGTERM, wait for the process to exit, and send SIGKILL after timeout) Either way, it seems that when the actor exits, all processes are force killed.

@architkulkarni
Copy link
Contributor

Seems like before this we also didn't take care of child processes, because we just send SIGKILL to process 1

Ah true, but I think before the SIGTERM PR we were sending SIGKILL to the entire process group.

it seems that when the actor exits, all processes are force killed.

That's surprising, any ideas how this happens?

Maybe the first question is what the user would expect the behavior to be. If a non-Ray user runs a command that starts some background child processes, then when the process for the command exits (naturally or by being killed) the child processes won't die, and this is expected. In that case, maybe we should have the same behavior in Ray Jobs, where we only kill the entrypoint process and don't take care of child processes from the user's entrypoint command. @zcin @edoakes any thoughts on this? Would this scenario happen in the real world?

If we do need to kill all child processes, then if you know a way to track all of them (or track the "process group" somehow), then I think that would be the right thing to do.

@zcin
Copy link
Contributor Author

zcin commented Dec 22, 2022

it seems that when the actor exits, all processes are force killed.

That's surprising, any ideas how this happens?

Hmm, not sure. If this is surprising I should probably double check, it was what I observed after many tests so I had just assumed.

@zcin
Copy link
Contributor Author

zcin commented Dec 22, 2022

Ah, we send SIGKILL to the entire process group when the actor exits.

@zcin
Copy link
Contributor Author

zcin commented Dec 22, 2022

I experimented, and we can use psutil to track process 2 instead of the process 1 (since there should be exactly one child of process 1). Then, the question remains whether we want to send SIGTERM to the process group, and if so, if we want to track all the child processes before sending SIGKILL.

@sihanwang41 sihanwang41 added P1 Issue that should be fixed within a few weeks core-clusters For launching and managing Ray clusters/jobs/kubernetes labels Dec 22, 2022
@sihanwang41
Copy link
Contributor

Hi @zcin, option2 sounds good to me. we can have a timeout for all the child process, if timeout, we send SIGKILL and kill all progress in the same group.

@architkulkarni
Copy link
Contributor

Ah, we send SIGKILL to the entire process group when the actor exits.

Ah nice, at first I thought it was some generic thing for all Ray actors which would have been surprising. This makes more sense.

Maybe we can make the decision by just preserving the original behavior as much as possible (where we SIGKILL the entire group.) If we see a strong need (like a user request for the background processes scenario) then we can consider changing it.

So to summarize what you/Sihan have come up with (let me know if this is off):

  1. send SIGTERM to all the child processes
  2. Track all of them, if any are still alive after timeout, SIGKILL the group

If this turns out to be hard or complicated, maybe Step 2 could be replaced by "Track process 2 and if it's still alive, SIGKILL the group", and it would be the user's responsibility to make sure child processes of process 2 exit whenever process 2 exits.

@zcin
Copy link
Contributor Author

zcin commented Dec 22, 2022

I'm wondering even if we for sure want to execute Step 1; is it possible that the user's job is itself tracking its child processes, and upon receiving a SIGTERM signal might want to terminate their child processes in a custom way? Do we want to give that option to the user?

@architkulkarni
Copy link
Contributor

That's a good point, I'm not sure (@edoakes @sihanwang41 thoughts?)

Two weak points in favor of signaling the whole process group:

  • From searching around it seems like sending SIGTERM signals to a process group is a common pattern,
  • We haven't received any negative user feedback about the original approach (kill the whole process group)

My vague impression is that this is a corner case which we should be able to change in the future without too much trouble if we get feedback on it.

@rkooo567
Copy link
Contributor

#31274 (comment)

This should be the standard way of all terminations. It is also how ray stop works now.

Also, I think sending signals to the group seems more useful in general. If there's a compelling use case of managing child proc themselves, we should support with a flag, but I think defaulting it to this option seems to be more prone to error for many users who don't know about this

edoakes pushed a commit that referenced this issue Jan 5, 2023
Resolves the issue described in #31274. On Linux systems, when a stop signal is sent, instead of killing + waiting on only the shell process (which starts the actual job as a child process), we want to kill all the children of the shell process along with the shell process itself, and poll all processes until they exit or send a force SIGKILL on timeout. (This change is compatible with Mac OSX systems as well)

Co-authored-by: shrekris-anyscale <[email protected]>
AmeerHajAli pushed a commit that referenced this issue Jan 12, 2023
Resolves the issue described in #31274. On Linux systems, when a stop signal is sent, instead of killing + waiting on only the shell process (which starts the actual job as a child process), we want to kill all the children of the shell process along with the shell process itself, and poll all processes until they exit or send a force SIGKILL on timeout. (This change is compatible with Mac OSX systems as well)

Co-authored-by: shrekris-anyscale <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this issue Jan 25, 2023
Resolves the issue described in ray-project#31274. On Linux systems, when a stop signal is sent, instead of killing + waiting on only the shell process (which starts the actual job as a child process), we want to kill all the children of the shell process along with the shell process itself, and poll all processes until they exit or send a force SIGKILL on timeout. (This change is compatible with Mac OSX systems as well)

Co-authored-by: shrekris-anyscale <[email protected]>
Signed-off-by: tmynn <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this issue Jan 25, 2023
Resolves the issue described in ray-project#31274. On Linux systems, when a stop signal is sent, instead of killing + waiting on only the shell process (which starts the actual job as a child process), we want to kill all the children of the shell process along with the shell process itself, and poll all processes until they exit or send a force SIGKILL on timeout. (This change is compatible with Mac OSX systems as well)

Co-authored-by: shrekris-anyscale <[email protected]>
Signed-off-by: tmynn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core-clusters For launching and managing Ray clusters/jobs/kubernetes P1 Issue that should be fixed within a few weeks triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants