Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdlib: supervisor unlink_flush improvement #8780

Merged
merged 2 commits into from
Sep 28, 2024

Conversation

u3s
Copy link
Contributor

@u3s u3s commented Sep 5, 2024

  • handling race between orderly shutdown from supervisor termination + and worker termination
  • make sure to report and correct error reason if possible

@u3s u3s added the team:PS Assigned to OTP team PS label Sep 5, 2024
Copy link
Contributor

github-actions bot commented Sep 5, 2024

CT Test Results

    2 files     95 suites   56m 57s ⏱️
2 153 tests 2 105 ✅ 48 💤 0 ❌
2 512 runs  2 462 ✅ 50 💤 0 ❌

Results for commit 1dd099d.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@u3s u3s self-assigned this Sep 5, 2024
@u3s u3s added testing currently being tested, tag is used by OTP internal CI and removed testing currently being tested, tag is used by OTP internal CI labels Sep 5, 2024
@u3s u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch from 2ceed3a to 684f167 Compare September 6, 2024 08:54
after 0 ->
NotExitReason
end
end;
unlink_flush(Pid, DefaultReason) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to do something like this so as we want to change noproc = Reaosn to be a normal case as that is the race.

unlink_flush(Pid, Reaon) ->
    %% We call unlink in order to guarantee that the 'EXIT' has arrived
    %% from the dead process. See the unlink docs for details.
    {links, Links} = process_info(self(), links),
    case lists:member(Pid, Links) of
        true ->
            do_unlink_flush(Pid, Reason);
        false  -> %% Naughty child
            flush(Pid, Reason)
    end.

do_unlink_flush(Pid, noproc = Reason) ->
    unlink(Pid),
    receive
        {'EXIT', Pid, Reason} ->
            normal
    end;
do_unlink_flush(Pid, Reason) ->
    unlink(Pid),
    receive
        {'EXIT', Pid, Reason} ->
            Reason
    end.

flush(Pid, NoExit) ->
    receive
        {'EXIT', Pid, Reason} ->
            Reason
    after 0 ->
            NoExit
    end.

@u3s u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch from 684f167 to 06738dd Compare September 6, 2024 09:37
@Maria-12648430
Copy link
Contributor

@u3s @IngelaAndin since I was involved in bringing up the current unlink_flush etc, could you please explain to me what this is about? Because TBH, I really don't get it 😅

@IngelaAndin
Copy link
Contributor

@Maria-12648430 This is about a race condition when the application is being stopped, so supervisors are shutting down their children and children might be terminating on their own in the same time. Also the supervisor is trying to avoid hanging if a child process behaves bad and does unlink. So when a supervisor is terminating a child and gets
a sown monitor message for the child with reason noproc, that just before terminated normally, and we can get an error report saying that the child could not be shutdown as it did not exist, due to the exit message has not arrived yet when hanging prevention kicks in. That we think should not happen.

@u3s
Copy link
Contributor Author

u3s commented Sep 6, 2024

We found a race condition taking place when supervision tree is terminated.
It might manifest with error messages from supervisor like:

*** System report during ssh_connection_SUITE:start_shell_exec/1 2024-09-05 07:37:02.617 ***[🔗](https://otp.ericsson.se/product/internal/test/test_results/progress_28/2024_09_04/otp_28_aramis_linux-gnu_x86_64_64-bit_jit_s4_a4_offheapmsgq_meamin/[email protected]_07.33.49/test.ssh_test.logs/run.2024-09-05_07.34.17/ssh_connection_suite.start_shell_exec.html#e-10825)
=SUPERVISOR REPORT==== 5-Sep-2024::07:37:02.617303 ===
    supervisor: {<0.18286.0>,ssh_channel_sup}
    errorContext: shutdown_error
    reason: noproc
    offender: [{pid,<0.18295.0>},
               {id,#Ref<0.2066620995.2317877250.244959>},
               {mfargs,{ssh_server_channel,start_link,undefined}},
               {restart_type,temporary},
               {significant,false},
               {shutdown,5000},
               {child_type,worker}]

Above error comes from ssh test code which cleans up after execution - connection is closed by the client and server is stopped.
Connection closure might collide with server supervision tree termination.

we would like to remove the race condition in that particular case.

@Maria-12648430
Copy link
Contributor

Ok, understood, thanks for explaining 🤗

Comment on lines 1528 to 1534
{links, Links} = process_info(self(), links),
case lists:member(Pid, Links) of
true ->
do_unlink_flush(Pid, Reason);
false -> %% Naughty child
do_flush(Pid, Reason)
end.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question re this, though. Is it guaranteed that, if a process was linked to the supervisor and process_info(self(), links) does not return its pid any more, the respective 'EXIT' message is then in the mailbox of the supervisor? Because if not, there still remains an (admittedly very tiny) window for a race condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question re this, though. Is it guaranteed that, if a process was linked to the supervisor and process_info(self(), links) does not return its pid any more, the respective 'EXIT' message is then in the mailbox of the supervisor?

Assuming that the link has been there since the child was spawned, yes.

@u3s u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch from 06738dd to abee2d9 Compare September 6, 2024 14:40
@u3s
Copy link
Contributor Author

u3s commented Sep 6, 2024

I've pushed some improvements but code is not working yet.
I don't know why. I've pushed for discussion purposes.

@IngelaAndin
Copy link
Contributor

IngelaAndin commented Sep 6, 2024

I investigated it a bit and I think that the ide of actually making sure the EXIT has arrived is not feasible, it enforces a lot of receiving of messages that was before ignored. And in that case possible race that @Maria-12648430 though about does not matter either. So I am more thinking the solution might be something like this. But still this solution breaks one test case that
is quite surprising .

unlink_flush(Pid, noproc = Reason) ->
    %% If local process is already terminated
    %% regard it as successful
    unlink(Pid),
    receive
        {'EXIT', Pid, Reason} ->
            shutdown
    after 0 ->
            shutdown
    end;
unlink_flush(Pid, Reason) ->
    unlink(Pid),
    receive
        {'EXIT', Pid, ActualReason} ->
            ActualReason
    after 0 -> 
            Reason   
    end.

The thing that breaks is:

{supervisor_SUITE,faulty_application_shutdown,[1212](https://github.com/erlang/otp/pull/supervisor_suite.src.html#1212)}
{badmatch,{bad_return,{{app_faulty,start,[normal,[]]},{'EXI

have to investigate further next week.

@Maria-12648430
Copy link
Contributor

Maria-12648430 commented Sep 6, 2024

But still this solution breaks one test case that is quite surprising.

Just a wild guess without having read anything else: you have noproc = Reason in the head of the upper clause, and you also have Reason in {'EXIT', Pid, Reason} in the receive in the body. There should be no such 'EXIT' message, though, meaning the real 'EXIT' message remains in the mailbox. Could that be it?

@Maria-12648430
Copy link
Contributor

Could that be it?

Hm... I just tried, but can't reproduce the failure you mentioned. No matter if I run it with the current code, with your snippet as you posted it, or with my suggested correction on it: the test passes every time.

@IngelaAndin
Copy link
Contributor

IngelaAndin commented Sep 8, 2024

@Maria-12648430 I think you are right about the first clause, it should not match reason noproc. I am still a not sure if the reason noproc best should result in a normal or shutdown (but we do not think it should be regarded as an error). As I understand it the unlink is to avoid having to care about "late exit messages" and changing the down reason to the exit reason is that sometimes they will differ and then the exit reason is the more informative one of what actually happened, but we can not afford to wait for the exit reason if it has not yet arrived. I get the failure every time, on Friday I started looking into the application master and maybe it needs adjusting but I will continue investigating on Monday when I am back to work.

@IngelaAndin
Copy link
Contributor

IngelaAndin commented Sep 9, 2024

@Maria-12648430 So I figured out what was wrong with the failing test case, it was my fault that accidently merged a bad test suite when making dynamic supervisor progress reports being debug logged. (So only a test suit problem, I have fixed it).

@u3s I think maybe the code should look like this:

unlink_flush(Pid, noproc) ->
    %% If supervisor is terminating a local process that has already
    %% terminated that operation is considered successfull
    unlink_flush(Pid, normal);
unlink_flush(Pid, MonitorReason) ->
    %% Return ExitReason if possible as when they differ ExitReason
    %% will be more informative
    unlink(Pid), %% We do no want an EXIT that has not arrived later on
    receive
        {'EXIT', Pid, ExitReason} ->
            ExitReason
    after 0 ->
            MonitorReason
    end.

Will it work with ssh?

@Maria-12648430
Copy link
Contributor

Maria-12648430 commented Sep 9, 2024

@Maria-12648430 So I figured out what was wrong with the failing test case, it was my fault that accidently merged a bad test suite when making dynamic supervisor progress reports being debug logged. (So only a test suit problem, I have fixed it).

😅

@u3s I think maybe the code should look like this:
...

To me at least it looks sound 👍

I also think that normal is the more appriopriate default reason. I don't know if, when monitor(process, Child) and exit(Child, shutdown) are called in that order, it is guaranteed that they arrive at the Child process in that same order or not. If it is guaranteed, then receiving a 'DOWN' message with reason noproc means the child exited on its own, ie before the monitor and the subsequent exit signals arrived. If it is not guaranteed, ie if it is possible that the exit signal may overtake the monitor signal, then we don't know, and then normal is as good as shutdown.

@rickard-green
Copy link
Contributor

rickard-green commented Sep 10, 2024

I don't know if, when monitor(process, Child) and exit(Child, shutdown) are called in that order, it is guaranteed that they arrive at the Child process in that same order or not.

Yes, it is guaranteed.

Assuming that the child adhere to the protocol (i.e., does not unlink itself from the supervisor), the exit message is guaranteed to eventually arrive. That is, just wait for the EXIT message forever would be good enough to make sure to get the actual exit reason.

Trying to handle children that do not adhere to the protocol, however, becomes tricky. A hostile child could even unlink itself and then send an exit signal using exit/2 and continue to execute.

One could argue that one should not even try to handle such scenarios. That is, expect the child to adhere to the protocol. If it doesn't, just let the supervisor hang. I don't know...

@Maria-12648430
Copy link
Contributor

I don't know if, when monitor(process, Child) and exit(Child, shutdown) are called in that order, it is guaranteed that they arrive at the Child process in that same order or not.

Yes, it is guaranteed.

In that case, normal should indeed be the most appropriate default exit reason (@IngelaAndin). Or rather, shutdown is then not the appropriate default exit reason because the child did definitely not exit as a consequence of the exit(Child, shutdown) from the supervisor, it was already dead before that.

Assuming that the child adhere to the protocol (i.e., does not unlink itself from the supervisor), the exit message is guaranteed to eventually arrive. That is, just wait for the EXIT message forever would be good enough to make sure to get the actual exit reason.

In a perfect world where everybody plays by the rules... yes 😬

Trying to handle children that do not adhere to the protocol, however, becomes tricky. A hostile child could even unlink itself and then send an exit signal using exit/2 and continue to execute.

One could argue that one should not even try to handle such scenarios. That is, expect the child to adhere to the protocol. If it doesn't, just let the supervisor hang. I don't know...

Since naughty children have been tolerated (just frowned upon), like, forever, I think that ship has sailed. There may even be sorta-kinda-valid use cases for that. I have not encountered any as of yet, but then again, I don't know 🤷‍♀️

@Maria-12648430
Copy link
Contributor

Maria-12648430 commented Sep 11, 2024

I don't know if, when monitor(process, Child) and exit(Child, shutdown) are called in that order, it is guaranteed that they arrive at the Child process in that same order or not.

Yes, it is guaranteed.

@IngelaAndin @u3s I was just thinking... Maybe this is a bit hare-brained or at least overkill-ish, but given what @rickard-green said, couldn't we... 😅

Like, when terminating a child C, we...:

  1. set up a monitor M1 on C
  2. send C a shutdown exit
  3. set up a second monitor M2 on C
  4. wait for a 'DOWN' message for M2; this is only to confirm the indisputable death of C; the reason may be anything, it doesn't matter; now, assuming I understood the implications of what @rickard-green said correctly, then now either...
    • the related 'EXIT' message is in the message queue if C was either a good child, or a very naughty child trying to feign its own death; or...
    • there will never be any related 'EXIT' message if C was a merely naughty child that has unlinked
  5. receive the 'DOWN' message for M1, and if...
    • the reason is not noproc, this is the actual exit reason for the death of C; flush out the corresponding 'EXIT' message with a timeout of 0; the 'EXIT' message itself may have been genuine or feigned, it doesn't matter
    • the reason is noproc, receive the corresponding 'EXIT' message with a timeout of 0, and if...
      • there is such an 'EXIT' message, it either means that C was a good child which just exited by itself: the included reason is the actual reason; or a very naughty child which went as far as trying to feign its own death but then actually exited by itself: this reason is the best we can get at then
      • there is no such 'EXIT' message, it means that C was a merely naughty child which exited by itself; we use normal as the exit reason, this is our best guess

(This is a bit simplified, there are shutdown timeouts etc to be taken into consideration of course. But I think you get the general idea.)

@IngelaAndin
Copy link
Contributor

Right now we are contemplating something like:

unlink_flush(Pid, noproc) ->
    {links, Ls} = process_info(self(),links),
    Timeout = case lists:member(Pid, Ls) of
                  true -> infinity;
                  false -> 0
              end,
    receive
        {'EXIT', Pid, ExitReason} ->
            ExitReason
    after Timeout ->
            naughty_child
    end;
unlink_flush(Pid, MonitorReason) ->
    unlink(Pid),
    receive
        {'EXIT', Pid, ExitReason} ->
            ExitReason
    after 0 ->
            MonitorReason
    end.

That will only perform the link check (that takes to much time in the normal case) when the monitor reason is noproc and ensuring we will get the correct EXIT for compliant code in this edge-case if children are naughty the guarantees of correctness might not be fulfilled but that is ok as the child already broke the contract!

@u3s u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch 2 times, most recently from d4a4059 to 6e6e5fc Compare September 11, 2024 15:21
@u3s u3s added the testing currently being tested, tag is used by OTP internal CI label Sep 11, 2024
@u3s u3s removed the testing currently being tested, tag is used by OTP internal CI label Sep 12, 2024
- handle termination race in supervisor
- race happens when supervisor shutdown procedure
- collides with child terminating from other reason
-
- in ssh tests this happens when client closes connection
- which propagates to server over network
- and results with termination of connection processes on server side
- in parallel, server being is shutdown during test cleanup
@u3s u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch from 6e6e5fc to edbf3eb Compare September 16, 2024 05:57
@u3s
Copy link
Contributor Author

u3s commented Sep 16, 2024

FYI, edbf3eb passed tests and seem to fix issue observed with ssh tests

@IngelaAndin
Copy link
Contributor

@rickard-green you wanted to add comments ?

@u3s u3s added the testing currently being tested, tag is used by OTP internal CI label Sep 19, 2024
@u3s
Copy link
Contributor Author

u3s commented Sep 23, 2024

ping @rickard-green

lib/stdlib/src/supervisor.erl Show resolved Hide resolved
lib/stdlib/src/supervisor.erl Show resolved Hide resolved
lib/stdlib/src/supervisor.erl Outdated Show resolved Hide resolved
@u3s u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch from e176651 to 1dd099d Compare September 26, 2024 07:16
@u3s u3s merged commit 602d398 into erlang:maint Sep 28, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:PS Assigned to OTP team PS testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants