stdlib: supervisor unlink_flush improvement #8780

u3s · 2024-09-05T07:22:50Z

handling race between orderly shutdown from supervisor termination + and worker termination
make sure to report and correct error reason if possible

github-actions · 2024-09-05T07:23:32Z

CT Test Results

2 files 95 suites 56m 57s ⏱️
2 153 tests 2 105 ✅ 48 💤 0 ❌
2 512 runs 2 462 ✅ 50 💤 0 ❌

Results for commit 1dd099d.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

IngelaAndin · 2024-09-06T09:09:55Z

lib/stdlib/src/supervisor.erl

+ after 0 ->
+ NotExitReason
+ end
+ end;
 unlink_flush(Pid, DefaultReason) ->


I think we need to do something like this so as we want to change noproc = Reaosn to be a normal case as that is the race.

unlink_flush(Pid, Reaon) -> %% We call unlink in order to guarantee that the 'EXIT' has arrived %% from the dead process. See the unlink docs for details. {links, Links} = process_info(self(), links), case lists:member(Pid, Links) of true -> do_unlink_flush(Pid, Reason); false -> %% Naughty child flush(Pid, Reason) end. do_unlink_flush(Pid, noproc = Reason) -> unlink(Pid), receive {'EXIT', Pid, Reason} -> normal end; do_unlink_flush(Pid, Reason) -> unlink(Pid), receive {'EXIT', Pid, Reason} -> Reason end. flush(Pid, NoExit) -> receive {'EXIT', Pid, Reason} -> Reason after 0 -> NoExit end.

Maria-12648430 · 2024-09-06T11:26:40Z

@u3s @IngelaAndin since I was involved in bringing up the current unlink_flush etc, could you please explain to me what this is about? Because TBH, I really don't get it 😅

IngelaAndin · 2024-09-06T11:42:17Z

@Maria-12648430 This is about a race condition when the application is being stopped, so supervisors are shutting down their children and children might be terminating on their own in the same time. Also the supervisor is trying to avoid hanging if a child process behaves bad and does unlink. So when a supervisor is terminating a child and gets
a sown monitor message for the child with reason noproc, that just before terminated normally, and we can get an error report saying that the child could not be shutdown as it did not exist, due to the exit message has not arrived yet when hanging prevention kicks in. That we think should not happen.

u3s · 2024-09-06T11:43:40Z

We found a race condition taking place when supervision tree is terminated.
It might manifest with error messages from supervisor like:

*** System report during ssh_connection_SUITE:start_shell_exec/1 2024-09-05 07:37:02.617 ***[🔗](https://otp.ericsson.se/product/internal/test/test_results/progress_28/2024_09_04/otp_28_aramis_linux-gnu_x86_64_64-bit_jit_s4_a4_offheapmsgq_meamin/[email protected]_07.33.49/test.ssh_test.logs/run.2024-09-05_07.34.17/ssh_connection_suite.start_shell_exec.html#e-10825)
=SUPERVISOR REPORT==== 5-Sep-2024::07:37:02.617303 ===
    supervisor: {<0.18286.0>,ssh_channel_sup}
    errorContext: shutdown_error
    reason: noproc
    offender: [{pid,<0.18295.0>},
               {id,#Ref<0.2066620995.2317877250.244959>},
               {mfargs,{ssh_server_channel,start_link,undefined}},
               {restart_type,temporary},
               {significant,false},
               {shutdown,5000},
               {child_type,worker}]

Above error comes from ssh test code which cleans up after execution - connection is closed by the client and server is stopped.
Connection closure might collide with server supervision tree termination.

we would like to remove the race condition in that particular case.

Maria-12648430 · 2024-09-06T13:19:51Z

Ok, understood, thanks for explaining 🤗

Maria-12648430 · 2024-09-06T13:38:37Z

lib/stdlib/src/supervisor.erl

+ {links, Links} = process_info(self(), links),
+ case lists:member(Pid, Links) of
+ true ->
+ do_unlink_flush(Pid, Reason);
+ false -> %% Naughty child
+ do_flush(Pid, Reason)
+ end.


A question re this, though. Is it guaranteed that, if a process was linked to the supervisor and process_info(self(), links) does not return its pid any more, the respective 'EXIT' message is then in the mailbox of the supervisor? Because if not, there still remains an (admittedly very tiny) window for a race condition.

A question re this, though. Is it guaranteed that, if a process was linked to the supervisor and process_info(self(), links) does not return its pid any more, the respective 'EXIT' message is then in the mailbox of the supervisor?

Assuming that the link has been there since the child was spawned, yes.

u3s · 2024-09-06T14:41:22Z

I've pushed some improvements but code is not working yet.
I don't know why. I've pushed for discussion purposes.

IngelaAndin · 2024-09-06T16:07:53Z

I investigated it a bit and I think that the ide of actually making sure the EXIT has arrived is not feasible, it enforces a lot of receiving of messages that was before ignored. And in that case possible race that @Maria-12648430 though about does not matter either. So I am more thinking the solution might be something like this. But still this solution breaks one test case that
is quite surprising .

unlink_flush(Pid, noproc = Reason) ->
    %% If local process is already terminated
    %% regard it as successful
    unlink(Pid),
    receive
        {'EXIT', Pid, Reason} ->
            shutdown
    after 0 ->
            shutdown
    end;
unlink_flush(Pid, Reason) ->
    unlink(Pid),
    receive
        {'EXIT', Pid, ActualReason} ->
            ActualReason
    after 0 -> 
            Reason   
    end.

The thing that breaks is:

{supervisor_SUITE,faulty_application_shutdown,[1212](https://github.com/erlang/otp/pull/supervisor_suite.src.html#1212)}
{badmatch,{bad_return,{{app_faulty,start,[normal,[]]},{'EXI

have to investigate further next week.

Maria-12648430 · 2024-09-06T20:04:52Z

But still this solution breaks one test case that is quite surprising.

Just a wild guess without having read anything else: you have noproc = Reason in the head of the upper clause, and you also have Reason in {'EXIT', Pid, Reason} in the receive in the body. There should be no such 'EXIT' message, though, meaning the real 'EXIT' message remains in the mailbox. Could that be it?

Maria-12648430 · 2024-09-07T08:30:52Z

Could that be it?

Hm... I just tried, but can't reproduce the failure you mentioned. No matter if I run it with the current code, with your snippet as you posted it, or with my suggested correction on it: the test passes every time.

IngelaAndin · 2024-09-08T07:21:32Z

@Maria-12648430 I think you are right about the first clause, it should not match reason noproc. I am still a not sure if the reason noproc best should result in a normal or shutdown (but we do not think it should be regarded as an error). As I understand it the unlink is to avoid having to care about "late exit messages" and changing the down reason to the exit reason is that sometimes they will differ and then the exit reason is the more informative one of what actually happened, but we can not afford to wait for the exit reason if it has not yet arrived. I get the failure every time, on Friday I started looking into the application master and maybe it needs adjusting but I will continue investigating on Monday when I am back to work.

IngelaAndin · 2024-09-09T08:22:34Z

@Maria-12648430 So I figured out what was wrong with the failing test case, it was my fault that accidently merged a bad test suite when making dynamic supervisor progress reports being debug logged. (So only a test suit problem, I have fixed it).

@u3s I think maybe the code should look like this:

unlink_flush(Pid, noproc) ->
    %% If supervisor is terminating a local process that has already
    %% terminated that operation is considered successfull
    unlink_flush(Pid, normal);
unlink_flush(Pid, MonitorReason) ->
    %% Return ExitReason if possible as when they differ ExitReason
    %% will be more informative
    unlink(Pid), %% We do no want an EXIT that has not arrived later on
    receive
        {'EXIT', Pid, ExitReason} ->
            ExitReason
    after 0 ->
            MonitorReason
    end.

Will it work with ssh?

Maria-12648430 · 2024-09-09T08:53:44Z

@Maria-12648430 So I figured out what was wrong with the failing test case, it was my fault that accidently merged a bad test suite when making dynamic supervisor progress reports being debug logged. (So only a test suit problem, I have fixed it).

😅

@u3s I think maybe the code should look like this:
...

To me at least it looks sound 👍

I also think that normal is the more appriopriate default reason. I don't know if, when monitor(process, Child) and exit(Child, shutdown) are called in that order, it is guaranteed that they arrive at the Child process in that same order or not. If it is guaranteed, then receiving a 'DOWN' message with reason noproc means the child exited on its own, ie before the monitor and the subsequent exit signals arrived. If it is not guaranteed, ie if it is possible that the exit signal may overtake the monitor signal, then we don't know, and then normal is as good as shutdown.

rickard-green · 2024-09-10T11:56:55Z

I don't know if, when monitor(process, Child) and exit(Child, shutdown) are called in that order, it is guaranteed that they arrive at the Child process in that same order or not.

Yes, it is guaranteed.

Assuming that the child adhere to the protocol (i.e., does not unlink itself from the supervisor), the exit message is guaranteed to eventually arrive. That is, just wait for the EXIT message forever would be good enough to make sure to get the actual exit reason.

Trying to handle children that do not adhere to the protocol, however, becomes tricky. A hostile child could even unlink itself and then send an exit signal using exit/2 and continue to execute.

One could argue that one should not even try to handle such scenarios. That is, expect the child to adhere to the protocol. If it doesn't, just let the supervisor hang. I don't know...

Maria-12648430 · 2024-09-11T09:55:43Z

I don't know if, when monitor(process, Child) and exit(Child, shutdown) are called in that order, it is guaranteed that they arrive at the Child process in that same order or not.

Yes, it is guaranteed.

In that case, normal should indeed be the most appropriate default exit reason (@IngelaAndin). Or rather, shutdown is then not the appropriate default exit reason because the child did definitely not exit as a consequence of the exit(Child, shutdown) from the supervisor, it was already dead before that.

Assuming that the child adhere to the protocol (i.e., does not unlink itself from the supervisor), the exit message is guaranteed to eventually arrive. That is, just wait for the EXIT message forever would be good enough to make sure to get the actual exit reason.

In a perfect world where everybody plays by the rules... yes 😬

Trying to handle children that do not adhere to the protocol, however, becomes tricky. A hostile child could even unlink itself and then send an exit signal using exit/2 and continue to execute.

One could argue that one should not even try to handle such scenarios. That is, expect the child to adhere to the protocol. If it doesn't, just let the supervisor hang. I don't know...

Since naughty children have been tolerated (just frowned upon), like, forever, I think that ship has sailed. There may even be sorta-kinda-valid use cases for that. I have not encountered any as of yet, but then again, I don't know 🤷‍♀️

Maria-12648430 · 2024-09-11T11:00:38Z

I don't know if, when monitor(process, Child) and exit(Child, shutdown) are called in that order, it is guaranteed that they arrive at the Child process in that same order or not.

Yes, it is guaranteed.

@IngelaAndin @u3s I was just thinking... Maybe this is a bit hare-brained or at least overkill-ish, but given what @rickard-green said, couldn't we... 😅

Like, when terminating a child C, we...:

set up a monitor M1 on C
send C a shutdown exit
set up a second monitor M2 on C
wait for a 'DOWN' message for M2; this is only to confirm the indisputable death of C; the reason may be anything, it doesn't matter; now, assuming I understood the implications of what @rickard-green said correctly, then now either...
- the related 'EXIT' message is in the message queue if C was either a good child, or a very naughty child trying to feign its own death; or...
- there will never be any related 'EXIT' message if C was a merely naughty child that has unlinked
receive the 'DOWN' message for M1, and if...
- the reason is not noproc, this is the actual exit reason for the death of C; flush out the corresponding 'EXIT' message with a timeout of 0; the 'EXIT' message itself may have been genuine or feigned, it doesn't matter
- the reason is noproc, receive the corresponding 'EXIT' message with a timeout of 0, and if...
  - there is such an 'EXIT' message, it either means that C was a good child which just exited by itself: the included reason is the actual reason; or a very naughty child which went as far as trying to feign its own death but then actually exited by itself: this reason is the best we can get at then
  - there is no such 'EXIT' message, it means that C was a merely naughty child which exited by itself; we use normal as the exit reason, this is our best guess

(This is a bit simplified, there are shutdown timeouts etc to be taken into consideration of course. But I think you get the general idea.)

IngelaAndin · 2024-09-11T11:16:12Z

Right now we are contemplating something like:

unlink_flush(Pid, noproc) ->
    {links, Ls} = process_info(self(),links),
    Timeout = case lists:member(Pid, Ls) of
                  true -> infinity;
                  false -> 0
              end,
    receive
        {'EXIT', Pid, ExitReason} ->
            ExitReason
    after Timeout ->
            naughty_child
    end;
unlink_flush(Pid, MonitorReason) ->
    unlink(Pid),
    receive
        {'EXIT', Pid, ExitReason} ->
            ExitReason
    after 0 ->
            MonitorReason
    end.

That will only perform the link check (that takes to much time in the normal case) when the monitor reason is noproc and ensuring we will get the correct EXIT for compliant code in this edge-case if children are naughty the guarantees of correctness might not be fulfilled but that is ok as the child already broke the contract!

- handle termination race in supervisor - race happens when supervisor shutdown procedure - collides with child terminating from other reason - - in ssh tests this happens when client closes connection - which propagates to server over network - and results with termination of connection processes on server side - in parallel, server being is shutdown during test cleanup

u3s · 2024-09-16T05:59:05Z

FYI, edbf3eb passed tests and seem to fix issue observed with ssh tests

IngelaAndin · 2024-09-16T07:35:44Z

@rickard-green you wanted to add comments ?

u3s · 2024-09-23T12:05:40Z

ping @rickard-green

lib/stdlib/src/supervisor.erl

Co-authored-by: Rickard Green <[email protected]>

u3s added the team:PS Assigned to OTP team PS label Sep 5, 2024

u3s self-assigned this Sep 5, 2024

u3s added testing currently being tested, tag is used by OTP internal CI and removed testing currently being tested, tag is used by OTP internal CI labels Sep 5, 2024

u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch from 2ceed3a to 684f167 Compare September 6, 2024 08:54

u3s requested review from rickard-green and IngelaAndin September 6, 2024 08:58

IngelaAndin reviewed Sep 6, 2024

View reviewed changes

u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch from 684f167 to 06738dd Compare September 6, 2024 09:37

Maria-12648430 reviewed Sep 6, 2024

View reviewed changes

u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch from 06738dd to abee2d9 Compare September 6, 2024 14:40

u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch 2 times, most recently from d4a4059 to 6e6e5fc Compare September 11, 2024 15:21

u3s added the testing currently being tested, tag is used by OTP internal CI label Sep 11, 2024

u3s removed the testing currently being tested, tag is used by OTP internal CI label Sep 12, 2024

u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch from 6e6e5fc to edbf3eb Compare September 16, 2024 05:57

u3s added the testing currently being tested, tag is used by OTP internal CI label Sep 19, 2024

rickard-green approved these changes Sep 25, 2024

View reviewed changes

lib/stdlib/src/supervisor.erl Show resolved Hide resolved

lib/stdlib/src/supervisor.erl Show resolved Hide resolved

lib/stdlib/src/supervisor.erl Outdated Show resolved Hide resolved

Apply suggestions from code review

1dd099d

Co-authored-by: Rickard Green <[email protected]>

u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch from e176651 to 1dd099d Compare September 26, 2024 07:16

IngelaAndin approved these changes Sep 26, 2024

View reviewed changes

u3s merged commit 602d398 into erlang:maint Sep 28, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stdlib: supervisor unlink_flush improvement #8780

stdlib: supervisor unlink_flush improvement #8780

u3s commented Sep 5, 2024

github-actions bot commented Sep 5, 2024 •

edited

Loading

IngelaAndin Sep 6, 2024

Maria-12648430 commented Sep 6, 2024

IngelaAndin commented Sep 6, 2024

u3s commented Sep 6, 2024 •

edited

Loading

Maria-12648430 commented Sep 6, 2024

Maria-12648430 Sep 6, 2024

rickard-green Sep 10, 2024

u3s commented Sep 6, 2024

IngelaAndin commented Sep 6, 2024 •

edited

Loading

Maria-12648430 commented Sep 6, 2024 •

edited

Loading

Maria-12648430 commented Sep 7, 2024

IngelaAndin commented Sep 8, 2024 •

edited

Loading

IngelaAndin commented Sep 9, 2024 •

edited

Loading

Maria-12648430 commented Sep 9, 2024 •

edited by IngelaAndin

Loading

rickard-green commented Sep 10, 2024 •

edited

Loading

Maria-12648430 commented Sep 11, 2024

Maria-12648430 commented Sep 11, 2024 •

edited

Loading

IngelaAndin commented Sep 11, 2024

u3s commented Sep 16, 2024

IngelaAndin commented Sep 16, 2024

u3s commented Sep 23, 2024

stdlib: supervisor unlink_flush improvement #8780

stdlib: supervisor unlink_flush improvement #8780

Conversation

u3s commented Sep 5, 2024

github-actions bot commented Sep 5, 2024 • edited Loading

CT Test Results

Artifacts

IngelaAndin Sep 6, 2024

Choose a reason for hiding this comment

Maria-12648430 commented Sep 6, 2024

IngelaAndin commented Sep 6, 2024

u3s commented Sep 6, 2024 • edited Loading

Maria-12648430 commented Sep 6, 2024

Maria-12648430 Sep 6, 2024

Choose a reason for hiding this comment

rickard-green Sep 10, 2024

Choose a reason for hiding this comment

u3s commented Sep 6, 2024

IngelaAndin commented Sep 6, 2024 • edited Loading

Maria-12648430 commented Sep 6, 2024 • edited Loading

Maria-12648430 commented Sep 7, 2024

IngelaAndin commented Sep 8, 2024 • edited Loading

IngelaAndin commented Sep 9, 2024 • edited Loading

Maria-12648430 commented Sep 9, 2024 • edited by IngelaAndin Loading

rickard-green commented Sep 10, 2024 • edited Loading

Maria-12648430 commented Sep 11, 2024

Maria-12648430 commented Sep 11, 2024 • edited Loading

IngelaAndin commented Sep 11, 2024

u3s commented Sep 16, 2024

IngelaAndin commented Sep 16, 2024

u3s commented Sep 23, 2024

github-actions bot commented Sep 5, 2024 •

edited

Loading

u3s commented Sep 6, 2024 •

edited

Loading

IngelaAndin commented Sep 6, 2024 •

edited

Loading

Maria-12648430 commented Sep 6, 2024 •

edited

Loading

IngelaAndin commented Sep 8, 2024 •

edited

Loading

IngelaAndin commented Sep 9, 2024 •

edited

Loading

Maria-12648430 commented Sep 9, 2024 •

edited by IngelaAndin

Loading

rickard-green commented Sep 10, 2024 •

edited

Loading

Maria-12648430 commented Sep 11, 2024 •

edited

Loading