-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stdlib: supervisor unlink_flush improvement #8780
stdlib: supervisor unlink_flush improvement #8780
Conversation
u3s
commented
Sep 5, 2024
- handling race between orderly shutdown from supervisor termination + and worker termination
- make sure to report and correct error reason if possible
CT Test Results 2 files 95 suites 56m 57s ⏱️ Results for commit 1dd099d. ♻️ This comment has been updated with latest results. To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass. See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally. Artifacts// Erlang/OTP Github Action Bot |
2ceed3a
to
684f167
Compare
lib/stdlib/src/supervisor.erl
Outdated
after 0 -> | ||
NotExitReason | ||
end | ||
end; | ||
unlink_flush(Pid, DefaultReason) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to do something like this so as we want to change noproc = Reaosn to be a normal case as that is the race.
unlink_flush(Pid, Reaon) ->
%% We call unlink in order to guarantee that the 'EXIT' has arrived
%% from the dead process. See the unlink docs for details.
{links, Links} = process_info(self(), links),
case lists:member(Pid, Links) of
true ->
do_unlink_flush(Pid, Reason);
false -> %% Naughty child
flush(Pid, Reason)
end.
do_unlink_flush(Pid, noproc = Reason) ->
unlink(Pid),
receive
{'EXIT', Pid, Reason} ->
normal
end;
do_unlink_flush(Pid, Reason) ->
unlink(Pid),
receive
{'EXIT', Pid, Reason} ->
Reason
end.
flush(Pid, NoExit) ->
receive
{'EXIT', Pid, Reason} ->
Reason
after 0 ->
NoExit
end.
684f167
to
06738dd
Compare
@u3s @IngelaAndin since I was involved in bringing up the current |
@Maria-12648430 This is about a race condition when the application is being stopped, so supervisors are shutting down their children and children might be terminating on their own in the same time. Also the supervisor is trying to avoid hanging if a child process behaves bad and does unlink. So when a supervisor is terminating a child and gets |
We found a race condition taking place when supervision tree is terminated.
Above error comes from ssh test code which cleans up after execution - connection is closed by the client and server is stopped. we would like to remove the race condition in that particular case. |
Ok, understood, thanks for explaining 🤗 |
lib/stdlib/src/supervisor.erl
Outdated
{links, Links} = process_info(self(), links), | ||
case lists:member(Pid, Links) of | ||
true -> | ||
do_unlink_flush(Pid, Reason); | ||
false -> %% Naughty child | ||
do_flush(Pid, Reason) | ||
end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question re this, though. Is it guaranteed that, if a process was linked to the supervisor and process_info(self(), links)
does not return its pid any more, the respective 'EXIT'
message is then in the mailbox of the supervisor? Because if not, there still remains an (admittedly very tiny) window for a race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question re this, though. Is it guaranteed that, if a process was linked to the supervisor and
process_info(self(), links)
does not return its pid any more, the respective'EXIT'
message is then in the mailbox of the supervisor?
Assuming that the link has been there since the child was spawned, yes.
06738dd
to
abee2d9
Compare
I've pushed some improvements but code is not working yet. |
I investigated it a bit and I think that the ide of actually making sure the EXIT has arrived is not feasible, it enforces a lot of receiving of messages that was before ignored. And in that case possible race that @Maria-12648430 though about does not matter either. So I am more thinking the solution might be something like this. But still this solution breaks one test case that
The thing that breaks is:
have to investigate further next week. |
Just a wild guess without having read anything else: you have |
Hm... I just tried, but can't reproduce the failure you mentioned. No matter if I run it with the current code, with your snippet as you posted it, or with my suggested correction on it: the test passes every time. |
@Maria-12648430 I think you are right about the first clause, it should not match reason noproc. I am still a not sure if the reason noproc best should result in a normal or shutdown (but we do not think it should be regarded as an error). As I understand it the unlink is to avoid having to care about "late exit messages" and changing the down reason to the exit reason is that sometimes they will differ and then the exit reason is the more informative one of what actually happened, but we can not afford to wait for the exit reason if it has not yet arrived. I get the failure every time, on Friday I started looking into the application master and maybe it needs adjusting but I will continue investigating on Monday when I am back to work. |
@Maria-12648430 So I figured out what was wrong with the failing test case, it was my fault that accidently merged a bad test suite when making dynamic supervisor progress reports being debug logged. (So only a test suit problem, I have fixed it). @u3s I think maybe the code should look like this:
Will it work with ssh? |
😅
To me at least it looks sound 👍 I also think that |
Yes, it is guaranteed. Assuming that the child adhere to the protocol (i.e., does not unlink itself from the supervisor), the exit message is guaranteed to eventually arrive. That is, just wait for the Trying to handle children that do not adhere to the protocol, however, becomes tricky. A hostile child could even unlink itself and then send an exit signal using One could argue that one should not even try to handle such scenarios. That is, expect the child to adhere to the protocol. If it doesn't, just let the supervisor hang. I don't know... |
In that case,
In a perfect world where everybody plays by the rules... yes 😬
Since naughty children have been tolerated (just frowned upon), like, forever, I think that ship has sailed. There may even be sorta-kinda-valid use cases for that. I have not encountered any as of yet, but then again, I don't know 🤷♀️ |
@IngelaAndin @u3s I was just thinking... Maybe this is a bit hare-brained or at least overkill-ish, but given what @rickard-green said, couldn't we... 😅 Like, when terminating a child C, we...:
(This is a bit simplified, there are shutdown timeouts etc to be taken into consideration of course. But I think you get the general idea.) |
Right now we are contemplating something like:
That will only perform the link check (that takes to much time in the normal case) when the monitor reason is noproc and ensuring we will get the correct EXIT for compliant code in this edge-case if children are naughty the guarantees of correctness might not be fulfilled but that is ok as the child already broke the contract! |
d4a4059
to
6e6e5fc
Compare
- handle termination race in supervisor - race happens when supervisor shutdown procedure - collides with child terminating from other reason - - in ssh tests this happens when client closes connection - which propagates to server over network - and results with termination of connection processes on server side - in parallel, server being is shutdown during test cleanup
6e6e5fc
to
edbf3eb
Compare
FYI, edbf3eb passed tests and seem to fix issue observed with ssh tests |
@rickard-green you wanted to add comments ? |
ping @rickard-green |
Co-authored-by: Rickard Green <[email protected]>
e176651
to
1dd099d
Compare