-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Scaling down etcd machine pool can cause multiple machines to be deleted unintentionally #42582
Comments
Looked over the CAPI code and have some follow up questions/points:
|
It seems that a workaround could be to change to "random" instead of "oldest" policy, in our case we can probably do that with a kyverno policy, but will the "Rancher" controller try to change it back? |
@sennerholm Unfortunately that is not a viable workaround, as rancher will set it back as you suspect. |
Ok, we as customers asked for possible workarounds in a ticket to Rancher support some days ago. It would be good to have some kind of "workaround" in place, last I checked the upstream ticket it was no update on it. |
The most effective workaround/mitigation for this issue is to simply create multiple "machine pools" with a quantity of 1, and thus not actually using machine pool scaling. As such, if you need 3 etcd nodes, you would create 3 machine pool entries, 1 for each node. cluster-api ordinarily expects the control plane provider (in this case, CAPR) to handle creation/manipulation of the machine objects, but v2prov/CAPR uses machine deployments for this and ends up hitting cluster-api bugs due to it. |
Moving to "Blocked" as proper fix will be addressing kubernetes-sigs/cluster-api#9334 - luckily it looks like there is some movement on this quite recently so there is a chance we may get a fix in CAPI. Also updating the milestone to circle back on this for next minor release as it will require bumping CAPI. |
Should be unblocked once #45090 is done. |
Should be testable on |
rancher/charts#3901 got merged so this is ready to test. |
Ticket #42582 - Test Results - ✅
Verified with (HA Helm or Docker) on Rancher
|
@snasovich will this fix be backported to 2.8.x? Given that the version is still supported. |
@bk201 , that's a "no" unfortunately:
For 2.9.0 we've implemented CAPI version pinning so it won't be an issue to bump CAPI going forward but unfortunately 2.8.x will be stuck on 1.4.x CAPI that has not received this fix. |
@snasovich Thanks for the detailed information. |
Rancher Server Setup
Information about the Cluster
Describe the bug
v2prov currently creates all machine deployments with a machine set deletion policy of Oldest. When scaling down a machine deployment, the oldest node is deleted, which could potentially be the init node, or rather the node that is considered the leader for purposes of etcd and controlplane joining. If multiple machines are created at roughly the same, the machine which comes lexicographically first will usually become the init node. When the machine set scales down, if the oldest node was the init node, the newly elected init node will have to restart, as the previous
server-url
flag will be removed, as it was pointed to the old node. During this time, the node may become unhealthy, and a controller runs in CAPI which will copy the status of the v1 node in the downstream cluster to the capi machine object as the conditionNodeHealthy
. When the machine set controller is reconciling to determine which machines to delete, already deleting machines and unhealthy machines are sorted with the same priority, and in order to break this tie the lexicographical first is chosen. This can cause multiple machines to be deleted if machines are named in lexicographical order in accordance with their age, and the controller runs when the node becomes unhealthy.To Reproduce
For a single node cluster, scale to 2 nodes then back to 1.
For a three node cluster, scale to 3 nodes then back to 2.
Repeat as many times as needed until issue reproduces. While theoretically possible to have more than 2 needs be deleted, I found it extremely unlikely.
Result
Multiple nodes are removed.
Expected Result
1 node is removed.
Screenshots
Additional context
This issue was discovered during testing of harvester/harvester#4358, and determined to be the root cause.
This is a bug in CAPI and is currently being tracked here: kubernetes-sigs/cluster-api#9334
Release Note for v2.7.6 / later releases as needed
There is a known issue caused by an upstream cluster-api issue with etcd node scale down operations on K3s/RKE2 machine-provisioned clusters. It is possible for the
cluster-api
core controllers to delete more than the desired quantity of etcd nodes when reconciling an RKE Machine Pool (see: kubernetes-sigs/cluster-api#9334 for the upstream issue that ties to this). As such, it is not recommended to scale down etcd nodes as this may inadvertently delete all etcd nodes in the pool. As always, we recommend that you have a robust backup strategy and store your etcd snapshots in a safe location.(SURE-7042)
The text was updated successfully, but these errors were encountered: