Wait for Ra servers exit in `delete_cluster/{1,2}` #270

dumbbell · 2022-04-07T12:45:09Z

So far, when the delete_cluster/{1,2} function returned, it meant that the request to stop the cluster was committed to the Ra log and applied. However, the actual stop of the Ra servers and the related supervision tree was handled asynchronously.

This patch uses Erlang monitors to watch the exit of the Ra servers. This way, we ensure that when the function returns, the Ra servers are really gone.

So far, when the `delete_cluster/{1,2}` function returned, it meant that the request to stop the cluster was committed to the Ra log and applied. However, the actual stop of the Ra servers and the related supervision tree was handled asynchronously. This patch uses Erlang monitors to watch the exit of the Ra servers. This way, we ensure that when the function returns, the Ra servers are really gone.

kjnilsson · 2022-04-11T11:40:36Z

I'm thinking this should be an option rather than the default but happy to discuss that further.

I'm not super sure making Ra cluster deletions a log command was ever strictly necessary in the first place.

michaelklishin · 2022-04-15T05:32:39Z

It can be an option but I think most would expect it to be on by default.

dumbbell · 2022-04-15T11:27:35Z

I added a wait_for_servers_exist_in_delete_cluster application environment variable to toggle the new behavior I propose, plus a delete_cluster/3 function to let the caller choose the behavior in the arguments.

I forgot to ask before: do you prefer an application environment variable, a function argument or both for this kind of settings?

In fact, I'm not so sure about the behavior of the main branch and what this patch should do. I mean, currently the function blocks until the command is applied but the servers exit is asynchronous. If we introduce an option to toggle the wait for the servers exit, does it mean the non-wait behavior should treat the Ra command itself as asynchronous too? As a user, if there is an option to wait or not, I would expect the function to be either entirely synchronous (including servers exit) or asynchronous (and not wait for the Ra command). Providing three states (synchronous, synchronous for the command only, or entirely asynchronous) seems too complicated for such a function.

The caller can pass a flag to `delete_clsuter/3` to indicate if he wants the function to wait for servers to exit. He can also set the application environment variable `wait_for_servers_exit_in_delete_cluster` to define the global behavior. The default behavior is the same as the historical one, meaning the function doesn't wait for servers exit.

michaelklishin · 2022-04-16T07:27:23Z

A function option sounds sufficient to me. I would rename wait_for_servers_exit_in_delete_cluster to wait_for_server_termination_when_deleting_clusters if we are to keep it.

kjnilsson · 2022-04-16T07:46:51Z

The Ra leader will not exit until all followers have received the cluster delete command. So this option will never return when trying to delete a cluster where a single member is down. Is this really the behaviour you want? I think an options argument (wait_for_exits) would be better so that the choice can be made per cluster.

dumbbell · 2022-04-19T09:21:42Z

I should have added more context to the issue's description.

Khepri's testsuite did that:

_ = ra:delete_cluster(ServerIds),
_ = supervisor:terminate_child(ra_systems_sup, RaSystem),
_ = supervisor:delete_child(ra_systems_sup, RaSystem),

I tried replacing the supervisor calls with an application:stop(ra). In the end, I just want any Ra state to go away so I can start a fresh cluster for the next testcase.

In both cases, supervisor logs an error with Erlang 23:

=SUPERVISOR REPORT==== 7-Apr-2022::08:37:55.154657 ===
    supervisor: {<0.2764.0>,ra_server_sup}
    errorContext: shutdown_error
    reason: {shutdown,delete}
    offender: [{pid,<0.2765.0>},
               {id,async_unset_in_put_test_},
               {mfargs,
                   {ra_server_proc,start_link,
...

(Erlang 24's supervisor kind of ignores the situation and doesn't log anything)

The workaround I use in Khepri's testsuite is what I put in this patch (monitor and wait for Ra servers to exit). To me, it makes sense that when ra:delete_cluster/2 returns, the cluster is gone (including its processes).

If a cluster member is down, I don't know. As you said, perhaps this shouldn't go through an applied command. Perhaps processes should simply delete their state (memory and disk) and terminate.

On the other hand, I don't know if the current behavior is incorrect from a supervision tree point of view: Erlang 23 did log an error, but Erlang 24 doesn't anymore. Perhaps it's not worth the effort. I can keep the workaround in Khepri's testsuite and prepare another patch to give more details in delete_cluster/2 documentation.

dumbbell · 2024-07-26T08:17:38Z

I’m closing this pull request as there is problably no demand for this and the Khepri testsuites works fine with the current behavior for quite some time now.

dumbbell added the enhancement label Apr 7, 2022

dumbbell requested a review from kjnilsson April 7, 2022 12:45

dumbbell force-pushed the wait-for-ra-servers-exit-in-delete_cluster branch from 22e24e8 to 864e7dc Compare April 7, 2022 13:53

dumbbell marked this pull request as ready for review April 7, 2022 16:09

dumbbell force-pushed the wait-for-ra-servers-exit-in-delete_cluster branch from c987f3e to 044849c Compare April 15, 2022 11:37

dumbbell closed this Jul 26, 2024

dumbbell deleted the wait-for-ra-servers-exit-in-delete_cluster branch July 26, 2024 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for Ra servers exit in `delete_cluster/{1,2}` #270

Wait for Ra servers exit in `delete_cluster/{1,2}` #270

dumbbell commented Apr 7, 2022

kjnilsson commented Apr 11, 2022

michaelklishin commented Apr 15, 2022

dumbbell commented Apr 15, 2022 •

edited

Loading

michaelklishin commented Apr 16, 2022

kjnilsson commented Apr 16, 2022 via email •

edited by dumbbell

Loading

dumbbell commented Apr 19, 2022

dumbbell commented Jul 26, 2024

Wait for Ra servers exit in delete_cluster/{1,2} #270

Wait for Ra servers exit in delete_cluster/{1,2} #270

Conversation

dumbbell commented Apr 7, 2022

kjnilsson commented Apr 11, 2022

michaelklishin commented Apr 15, 2022

dumbbell commented Apr 15, 2022 • edited Loading

michaelklishin commented Apr 16, 2022

kjnilsson commented Apr 16, 2022 via email • edited by dumbbell Loading

dumbbell commented Apr 19, 2022

dumbbell commented Jul 26, 2024

Wait for Ra servers exit in `delete_cluster/{1,2}` #270

Wait for Ra servers exit in `delete_cluster/{1,2}` #270

dumbbell commented Apr 15, 2022 •

edited

Loading

kjnilsson commented Apr 16, 2022 via email •

edited by dumbbell

Loading