Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdlib: supervisor unlink_flush improvement #8780

Open
wants to merge 1 commit into
base: maint
Choose a base branch
from

Conversation

u3s
Copy link
Contributor

@u3s u3s commented Sep 5, 2024

  • handling race between orderly shutdown from supervisor termination + and worker termination
  • make sure to report and correct error reason if possible

@u3s u3s added the team:PS Assigned to OTP team PS label Sep 5, 2024
Copy link
Contributor

github-actions bot commented Sep 5, 2024

CT Test Results

    2 files     95 suites   1h 0m 33s ⏱️
2 153 tests 2 102 ✅ 48 💤 3 ❌
2 512 runs  2 459 ✅ 50 💤 3 ❌

For more details on these failures, see this check.

Results for commit abee2d9.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@u3s u3s self-assigned this Sep 5, 2024
@u3s u3s added testing currently being tested, tag is used by OTP internal CI and removed testing currently being tested, tag is used by OTP internal CI labels Sep 5, 2024
@u3s u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch from 2ceed3a to 684f167 Compare September 6, 2024 08:54
after 0 ->
NotExitReason
end
end;
unlink_flush(Pid, DefaultReason) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to do something like this so as we want to change noproc = Reaosn to be a normal case as that is the race.

unlink_flush(Pid, Reaon) ->
    %% We call unlink in order to guarantee that the 'EXIT' has arrived
    %% from the dead process. See the unlink docs for details.
    {links, Links} = process_info(self(), links),
    case lists:member(Pid, Links) of
        true ->
            do_unlink_flush(Pid, Reason);
        false  -> %% Naughty child
            flush(Pid, Reason)
    end.

do_unlink_flush(Pid, noproc = Reason) ->
    unlink(Pid),
    receive
        {'EXIT', Pid, Reason} ->
            normal
    end;
do_unlink_flush(Pid, Reason) ->
    unlink(Pid),
    receive
        {'EXIT', Pid, Reason} ->
            Reason
    end.

flush(Pid, NoExit) ->
    receive
        {'EXIT', Pid, Reason} ->
            Reason
    after 0 ->
            NoExit
    end.

@u3s u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch from 684f167 to 06738dd Compare September 6, 2024 09:37
@Maria-12648430
Copy link
Contributor

@u3s @IngelaAndin since I was involved in bringing up the current unlink_flush etc, could you please explain to me what this is about? Because TBH, I really don't get it 😅

@IngelaAndin
Copy link
Contributor

@Maria-12648430 This is about a race condition when the application is being stopped, so supervisors are shutting down their children and children might be terminating on their own in the same time. Also the supervisor is trying to avoid hanging if a child process behaves bad and does unlink. So when a supervisor is terminating a child and gets
a sown monitor message for the child with reason noproc, that just before terminated normally, and we can get an error report saying that the child could not be shutdown as it did not exist, due to the exit message has not arrived yet when hanging prevention kicks in. That we think should not happen.

@u3s
Copy link
Contributor Author

u3s commented Sep 6, 2024

We found a race condition taking place when supervision tree is terminated.
It might manifest with error messages from supervisor like:

*** System report during ssh_connection_SUITE:start_shell_exec/1 2024-09-05 07:37:02.617 ***[🔗](https://otp.ericsson.se/product/internal/test/test_results/progress_28/2024_09_04/otp_28_aramis_linux-gnu_x86_64_64-bit_jit_s4_a4_offheapmsgq_meamin/[email protected]_07.33.49/test.ssh_test.logs/run.2024-09-05_07.34.17/ssh_connection_suite.start_shell_exec.html#e-10825)
=SUPERVISOR REPORT==== 5-Sep-2024::07:37:02.617303 ===
    supervisor: {<0.18286.0>,ssh_channel_sup}
    errorContext: shutdown_error
    reason: noproc
    offender: [{pid,<0.18295.0>},
               {id,#Ref<0.2066620995.2317877250.244959>},
               {mfargs,{ssh_server_channel,start_link,undefined}},
               {restart_type,temporary},
               {significant,false},
               {shutdown,5000},
               {child_type,worker}]

Above error comes from ssh test code which cleans up after execution - connection is closed by the client and server is stopped.
Connection closure might collide with server supervision tree termination.

we would like to remove the race condition in that particular case.

@Maria-12648430
Copy link
Contributor

Ok, understood, thanks for explaining 🤗

Comment on lines +1528 to +1534
{links, Links} = process_info(self(), links),
case lists:member(Pid, Links) of
true ->
do_unlink_flush(Pid, Reason);
false -> %% Naughty child
do_flush(Pid, Reason)
end.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question re this, though. Is it guaranteed that, if a process was linked to the supervisor and process_info(self(), links) does not return its pid any more, the respective 'EXIT' message is then in the mailbox of the supervisor? Because if not, there still remains an (admittedly very tiny) window for a race condition.

- handling race between orderly shutdown from supervisor termination +
  and worker termination
- make sure to report and correct error reason if possible
@u3s u3s force-pushed the kuba/stdlib/noproc_shutdown_supervisor_fix2 branch from 06738dd to abee2d9 Compare September 6, 2024 14:40
@u3s
Copy link
Contributor Author

u3s commented Sep 6, 2024

I've pushed some improvements but code is not working yet.
I don't know why. I've pushed for discussion purposes.

@IngelaAndin
Copy link
Contributor

IngelaAndin commented Sep 6, 2024

I investigated it a bit and I think that the ide of actually making sure the EXIT has arrived is not feasible, it enforces a lot of receiving of messages that was before ignored. And in that case possible race that @Maria-12648430 though about does not matter either. So I am more thinking the solution might be something like this. But still this solution breaks one test case that
is quite surprising .

unlink_flush(Pid, noproc = Reason) ->
    %% If local process is already terminated
    %% regard it as successful
    unlink(Pid),
    receive
        {'EXIT', Pid, Reason} ->
            shutdown
    after 0 ->
            shutdown
    end;
unlink_flush(Pid, Reason) ->
    unlink(Pid),
    receive
        {'EXIT', Pid, ActualReason} ->
            ActualReason
    after 0 -> 
            Reason   
    end.

The thing that breaks is:

{supervisor_SUITE,faulty_application_shutdown,[1212](https://github.com/erlang/otp/pull/supervisor_suite.src.html#1212)}
{badmatch,{bad_return,{{app_faulty,start,[normal,[]]},{'EXI

have to investigate further next week.

@Maria-12648430
Copy link
Contributor

Maria-12648430 commented Sep 6, 2024

But still this solution breaks one test case that is quite surprising.

Just a wild guess without having read anything else: you have noproc = Reason in the head of the upper clause, and you also have Reason in {'EXIT', Pid, Reason} in the receive in the body. There should be no such 'EXIT' message, though, meaning the real 'EXIT' message remains in the mailbox. Could that be it?

@Maria-12648430
Copy link
Contributor

Could that be it?

Hm... I just tried, but can't reproduce the failure you mentioned. No matter if I run it with the current code, with your snippet as you posted it, or with my suggested correction on it: the test passes every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:PS Assigned to OTP team PS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants