Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for Ra servers exit in delete_cluster/{1,2} #270

Closed
wants to merge 2 commits into from

Conversation

dumbbell
Copy link
Member

@dumbbell dumbbell commented Apr 7, 2022

So far, when the delete_cluster/{1,2} function returned, it meant that the request to stop the cluster was committed to the Ra log and applied. However, the actual stop of the Ra servers and the related supervision tree was handled asynchronously.

This patch uses Erlang monitors to watch the exit of the Ra servers. This way, we ensure that when the function returns, the Ra servers are really gone.

So far, when the `delete_cluster/{1,2}` function returned, it meant that
the request to stop the cluster was committed to the Ra log and applied.
However, the actual stop of the Ra servers and the related supervision
tree was handled asynchronously.

This patch uses Erlang monitors to watch the exit of the Ra servers.
This way, we ensure that when the function returns, the Ra servers are
really gone.
@dumbbell dumbbell force-pushed the wait-for-ra-servers-exit-in-delete_cluster branch from 22e24e8 to 864e7dc Compare April 7, 2022 13:53
@dumbbell dumbbell marked this pull request as ready for review April 7, 2022 16:09
@kjnilsson
Copy link
Contributor

I'm thinking this should be an option rather than the default but happy to discuss that further.

I'm not super sure making Ra cluster deletions a log command was ever strictly necessary in the first place.

@michaelklishin
Copy link
Member

It can be an option but I think most would expect it to be on by default.

@dumbbell
Copy link
Member Author

dumbbell commented Apr 15, 2022

I added a wait_for_servers_exist_in_delete_cluster application environment variable to toggle the new behavior I propose, plus a delete_cluster/3 function to let the caller choose the behavior in the arguments.

I forgot to ask before: do you prefer an application environment variable, a function argument or both for this kind of settings?

In fact, I'm not so sure about the behavior of the main branch and what this patch should do. I mean, currently the function blocks until the command is applied but the servers exit is asynchronous. If we introduce an option to toggle the wait for the servers exit, does it mean the non-wait behavior should treat the Ra command itself as asynchronous too? As a user, if there is an option to wait or not, I would expect the function to be either entirely synchronous (including servers exit) or asynchronous (and not wait for the Ra command). Providing three states (synchronous, synchronous for the command only, or entirely asynchronous) seems too complicated for such a function.

The caller can pass a flag to `delete_clsuter/3` to indicate if he wants
the function to wait for servers to exit. He can also set the
application environment variable
`wait_for_servers_exit_in_delete_cluster` to define the global behavior.

The default behavior is the same as the historical one, meaning the
function doesn't wait for servers exit.
@dumbbell dumbbell force-pushed the wait-for-ra-servers-exit-in-delete_cluster branch from c987f3e to 044849c Compare April 15, 2022 11:37
@michaelklishin
Copy link
Member

A function option sounds sufficient to me. I would rename wait_for_servers_exit_in_delete_cluster to wait_for_server_termination_when_deleting_clusters if we are to keep it.

@kjnilsson
Copy link
Contributor

kjnilsson commented Apr 16, 2022 via email

@dumbbell
Copy link
Member Author

I should have added more context to the issue's description.

Khepri's testsuite did that:

_ = ra:delete_cluster(ServerIds),
_ = supervisor:terminate_child(ra_systems_sup, RaSystem),
_ = supervisor:delete_child(ra_systems_sup, RaSystem),

I tried replacing the supervisor calls with an application:stop(ra). In the end, I just want any Ra state to go away so I can start a fresh cluster for the next testcase.

In both cases, supervisor logs an error with Erlang 23:

=SUPERVISOR REPORT==== 7-Apr-2022::08:37:55.154657 ===
    supervisor: {<0.2764.0>,ra_server_sup}
    errorContext: shutdown_error
    reason: {shutdown,delete}
    offender: [{pid,<0.2765.0>},
               {id,async_unset_in_put_test_},
               {mfargs,
                   {ra_server_proc,start_link,
...

(Erlang 24's supervisor kind of ignores the situation and doesn't log anything)

The workaround I use in Khepri's testsuite is what I put in this patch (monitor and wait for Ra servers to exit). To me, it makes sense that when ra:delete_cluster/2 returns, the cluster is gone (including its processes).

If a cluster member is down, I don't know. As you said, perhaps this shouldn't go through an applied command. Perhaps processes should simply delete their state (memory and disk) and terminate.

On the other hand, I don't know if the current behavior is incorrect from a supervision tree point of view: Erlang 23 did log an error, but Erlang 24 doesn't anymore. Perhaps it's not worth the effort. I can keep the workaround in Khepri's testsuite and prepare another patch to give more details in delete_cluster/2 documentation.

@dumbbell
Copy link
Member Author

I’m closing this pull request as there is problably no demand for this and the Khepri testsuites works fine with the current behavior for quite some time now.

@dumbbell dumbbell closed this Jul 26, 2024
@dumbbell dumbbell deleted the wait-for-ra-servers-exit-in-delete_cluster branch July 26, 2024 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants