-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for Ra servers exit in delete_cluster/{1,2}
#270
Conversation
So far, when the `delete_cluster/{1,2}` function returned, it meant that the request to stop the cluster was committed to the Ra log and applied. However, the actual stop of the Ra servers and the related supervision tree was handled asynchronously. This patch uses Erlang monitors to watch the exit of the Ra servers. This way, we ensure that when the function returns, the Ra servers are really gone.
22e24e8
to
864e7dc
Compare
I'm thinking this should be an option rather than the default but happy to discuss that further. I'm not super sure making Ra cluster deletions a log command was ever strictly necessary in the first place. |
It can be an option but I think most would expect it to be on by default. |
I added a I forgot to ask before: do you prefer an application environment variable, a function argument or both for this kind of settings? In fact, I'm not so sure about the behavior of the |
The caller can pass a flag to `delete_clsuter/3` to indicate if he wants the function to wait for servers to exit. He can also set the application environment variable `wait_for_servers_exit_in_delete_cluster` to define the global behavior. The default behavior is the same as the historical one, meaning the function doesn't wait for servers exit.
c987f3e
to
044849c
Compare
A function option sounds sufficient to me. I would rename |
The Ra leader will not exit until all followers have received the cluster
delete command. So this option will never return when trying to delete a
cluster where a single member is down. Is this really the behaviour you
want?
I think an options argument (wait_for_exits) would be better so that the
choice can be made per cluster.
|
I should have added more context to the issue's description. Khepri's testsuite did that:
I tried replacing the In both cases,
(Erlang 24's supervisor kind of ignores the situation and doesn't log anything) The workaround I use in Khepri's testsuite is what I put in this patch (monitor and wait for Ra servers to exit). To me, it makes sense that when If a cluster member is down, I don't know. As you said, perhaps this shouldn't go through an applied command. Perhaps processes should simply delete their state (memory and disk) and terminate. On the other hand, I don't know if the current behavior is incorrect from a supervision tree point of view: Erlang 23 did log an error, but Erlang 24 doesn't anymore. Perhaps it's not worth the effort. I can keep the workaround in Khepri's testsuite and prepare another patch to give more details in |
I’m closing this pull request as there is problably no demand for this and the Khepri testsuites works fine with the current behavior for quite some time now. |
So far, when the
delete_cluster/{1,2}
function returned, it meant that the request to stop the cluster was committed to the Ra log and applied. However, the actual stop of the Ra servers and the related supervision tree was handled asynchronously.This patch uses Erlang monitors to watch the exit of the Ra servers. This way, we ensure that when the function returns, the Ra servers are really gone.