[BUG] Scaling down etcd machine pool can cause multiple machines to be deleted unintentionally #42582

jakefhyde · 2023-08-30T02:53:25Z

Rancher Server Setup

Rancher version: v2.7-head

Information about the Cluster

Cluster Type (Local/Downstream): Downstream
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Rancher provisioned RKE2

Describe the bug

v2prov currently creates all machine deployments with a machine set deletion policy of Oldest. When scaling down a machine deployment, the oldest node is deleted, which could potentially be the init node, or rather the node that is considered the leader for purposes of etcd and controlplane joining. If multiple machines are created at roughly the same, the machine which comes lexicographically first will usually become the init node. When the machine set scales down, if the oldest node was the init node, the newly elected init node will have to restart, as the previous server-url flag will be removed, as it was pointed to the old node. During this time, the node may become unhealthy, and a controller runs in CAPI which will copy the status of the v1 node in the downstream cluster to the capi machine object as the condition NodeHealthy. When the machine set controller is reconciling to determine which machines to delete, already deleting machines and unhealthy machines are sorted with the same priority, and in order to break this tie the lexicographical first is chosen. This can cause multiple machines to be deleted if machines are named in lexicographical order in accordance with their age, and the controller runs when the node becomes unhealthy.

To Reproduce

For a single node cluster, scale to 2 nodes then back to 1.
For a three node cluster, scale to 3 nodes then back to 2.

Repeat as many times as needed until issue reproduces. While theoretically possible to have more than 2 needs be deleted, I found it extremely unlikely.

Result

Multiple nodes are removed.

Expected Result

1 node is removed.

Screenshots

Additional context

This issue was discovered during testing of harvester/harvester#4358, and determined to be the root cause.

This is a bug in CAPI and is currently being tracked here: kubernetes-sigs/cluster-api#9334

Release Note for v2.7.6 / later releases as needed
There is a known issue caused by an upstream cluster-api issue with etcd node scale down operations on K3s/RKE2 machine-provisioned clusters. It is possible for the cluster-api core controllers to delete more than the desired quantity of etcd nodes when reconciling an RKE Machine Pool (see: kubernetes-sigs/cluster-api#9334 for the upstream issue that ties to this). As such, it is not recommended to scale down etcd nodes as this may inadvertently delete all etcd nodes in the pool. As always, we recommend that you have a robust backup strategy and store your etcd snapshots in a safe location.
(SURE-7042)

The text was updated successfully, but these errors were encountered:

slickwarren · 2023-08-30T19:25:13Z

Looked over the CAPI code and have some follow up questions/points:

I see that the random delete policy will fix this issue. However, there will be an edge case where new nodes (create timestamp <=0) could be deleted
is there a typical timeline we could expect upstream to fix the issue for oldest delete policy instead of a workaround for rancher?
if we don't wait on CAPI to fix it, is there any reason outside of the edge case listed above that rancher would care about a not-last delete in machine pools?
will customers / their existing workflows or external automation be affected? I think some are depending on the last node being removed when scaling, but this customer may only be on rke1. However, other customers may be in a similar predicament
if we switch to random delete policy and CAPI fixes it for oldest delete policy, will we switch back to oldest or stay on random?

sennerholm · 2023-09-28T16:27:47Z

It seems that a workaround could be to change to "random" instead of "oldest" policy, in our case we can probably do that with a kyverno policy, but will the "Rancher" controller try to change it back?
ie, can we use that as a workaround?

jakefhyde · 2023-10-10T16:18:19Z

@sennerholm Unfortunately that is not a viable workaround, as rancher will set it back as you suspect.

sennerholm · 2023-10-10T18:52:17Z

Ok, we as customers asked for possible workarounds in a ticket to Rancher support some days ago. It would be good to have some kind of "workaround" in place, last I checked the upstream ticket it was no update on it.

Oats87 · 2023-10-25T22:20:17Z

Ok, we as customers asked for possible workarounds in a ticket to Rancher support some days ago. It would be good to have some kind of "workaround" in place, last I checked the upstream ticket it was no update on it.

The most effective workaround/mitigation for this issue is to simply create multiple "machine pools" with a quantity of 1, and thus not actually using machine pool scaling. As such, if you need 3 etcd nodes, you would create 3 machine pool entries, 1 for each node.

cluster-api ordinarily expects the control plane provider (in this case, CAPR) to handle creation/manipulation of the machine objects, but v2prov/CAPR uses machine deployments for this and ends up hitting cluster-api bugs due to it.

snasovich · 2024-03-08T16:58:33Z

Moving to "Blocked" as proper fix will be addressing kubernetes-sigs/cluster-api#9334 - luckily it looks like there is some movement on this quite recently so there is a chance we may get a fix in CAPI. Also updating the milestone to circle back on this for next minor release as it will require bumping CAPI.
@jakefhyde @Oats87 , please feel free to correct / chime in with more info.

snasovich · 2024-04-17T15:51:57Z

Should be unblocked once #45090 is done.

snasovich · 2024-05-13T20:07:01Z

Should be testable on v2.9-head once bump to 1.6.4 is merged (rancher/charts#3901)

snasovich · 2024-05-22T02:29:23Z

rancher/charts#3901 got merged so this is ready to test.

susesgartner · 2024-05-23T20:50:20Z

Ticket #42582 - Test Results - ✅

Scenario	Test Case	Result
1.	Scaling test	Pass

Verified with (HA Helm or Docker) on Rancher `v2.9-3c48a195381ad55f9fef7e9bf32b1593d43368ca-head`:

Scenario 1: Scaling test

Create a 8 node 3etcd 2cp 3worker cluster
Scale down the cluster by 1 etcd node
Wait for the cluster to come up active
Scale down the cluster by 1 more etcd node
Scale the cluster back up to 3 etcd nodes.
Repeat steps 2-5 a few times

Results: The pool only scales down by 1 each time and the cluster remains in a healthy state.

bk201 · 2024-06-28T09:50:18Z

@snasovich will this fix be backported to 2.8.x? Given that the version is still supported.

snasovich · 2024-06-28T21:14:39Z

@bk201 , that's a "no" unfortunately:

The fix to MachineSet reconcilation too eager during scale down when using Oldest or Newest deletion policies kubernetes-sigs/cluster-api#9334 is not available in CAPI 1.4.x release line
2.8.x is currently on CAPI 1.4.4 and we can only bump to newer v1.4.x. If we bump to anything 1.5.0+ we will have repeat of Resolved - [BUG] All pools unavailable on RKE2/K3s provisioned clusters on Rancher 2.8.0-2.8.2 #44929

For 2.9.0 we've implemented CAPI version pinning so it won't be an issue to bump CAPI going forward but unfortunately 2.8.x will be stuck on 1.4.x CAPI that has not received this fix.

bk201 · 2024-07-01T01:21:35Z

@snasovich Thanks for the detailed information.

slickwarren assigned slickwarren and jakefhyde Aug 30, 2023

bk201 mentioned this issue Aug 28, 2023

[BUG] Fail to scale down RKE2 Harvester node driver cluster harvester/harvester#4358

Closed

snasovich added status/has-dependency release-note Note this issue in the milestone's release notes [zube]: Release Note labels Aug 31, 2023

snasovich assigned Oats87 Aug 31, 2023

snasovich added this to the v2.8.0 milestone Aug 31, 2023

albinsun mentioned this issue Sep 21, 2023

Rancher integration test - Rancher 2.7.6 + Harvester 1.1.2 harvester/tests#936

Closed

Sahota1225 modified the milestones: v2.8.0, 2024-v2.8x-Backlog Sep 27, 2023

Sahota1225 added the internal label Oct 6, 2023

snasovich mentioned this issue Oct 16, 2023

[BUG] Deleting one etcd node, deletes two in an RKE2 cluster #43173

Closed

jakefhyde mentioned this issue Oct 20, 2023

Add pre-drain hook to ensure new init node is functional before remov… #43230

Merged

martyav added the status/release-note-added label Oct 27, 2023

snasovich mentioned this issue Oct 30, 2023

[BUG] Cluster breaks when scaling down etcd nodes in an RKE2 cluster #43097

Closed

This was referenced Oct 30, 2023

#43097 revert #43330

Merged

Revert "Add pre-drain hook to ensure new init node is functional befo… #43352

Merged

albinsun mentioned this issue Dec 14, 2023

Rancher integration test - Harvester 1.1.2 + Rancher 2.7.9 harvester/tests#1012

Closed

daviswill2 assigned susesgartner and unassigned slickwarren Feb 28, 2024

snasovich added the [zube]: Blocked label Mar 8, 2024

zube bot removed the [zube]: Release Note label Mar 8, 2024

snasovich modified the milestones: v2.8-Next1, v2.9-Next1 Mar 8, 2024

snasovich mentioned this issue Apr 12, 2024

Update CAPI in Rancher to 1.7.1+ #45090

Closed

davidcassany mentioned this issue May 14, 2024

Downscaling Rancher issues tracker rancher/elemental#1416

Closed

snasovich added the priority/1 label May 22, 2024

Jono-SUSE-Rancher removed the [zube]: Blocked label May 22, 2024

albinsun mentioned this issue May 23, 2024

[TEST] Harvester v1.2.2 - Rancher v2.8.4 Integration Test harvester/tests#1279

Closed

susesgartner closed this as completed May 23, 2024

albinsun mentioned this issue May 24, 2024

[TEST] Harvester v1.2.2 - Rancher v2.7.13 Integration Test harvester/tests#1280

Closed

albinsun mentioned this issue Sep 27, 2024

[TEST] Run Harvester Rancher integration with Rancher 2.8.7 harvester/tests#1528

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Scaling down etcd machine pool can cause multiple machines to be deleted unintentionally #42582

[BUG] Scaling down etcd machine pool can cause multiple machines to be deleted unintentionally #42582

jakefhyde commented Aug 30, 2023 •

edited by Sahota1225

Loading

slickwarren commented Aug 30, 2023

sennerholm commented Sep 28, 2023 •

edited

Loading

jakefhyde commented Oct 10, 2023

sennerholm commented Oct 10, 2023

Oats87 commented Oct 25, 2023

snasovich commented Mar 8, 2024

snasovich commented Apr 17, 2024

snasovich commented May 13, 2024

snasovich commented May 22, 2024

susesgartner commented May 23, 2024

bk201 commented Jun 28, 2024

snasovich commented Jun 28, 2024

bk201 commented Jul 1, 2024

[BUG] Scaling down etcd machine pool can cause multiple machines to be deleted unintentionally #42582

[BUG] Scaling down etcd machine pool can cause multiple machines to be deleted unintentionally #42582

Comments

jakefhyde commented Aug 30, 2023 • edited by Sahota1225 Loading

slickwarren commented Aug 30, 2023

sennerholm commented Sep 28, 2023 • edited Loading

jakefhyde commented Oct 10, 2023

sennerholm commented Oct 10, 2023

Oats87 commented Oct 25, 2023

snasovich commented Mar 8, 2024

snasovich commented Apr 17, 2024

snasovich commented May 13, 2024

snasovich commented May 22, 2024

susesgartner commented May 23, 2024

Ticket #42582 - Test Results - ✅

Verified with (HA Helm or Docker) on Rancher v2.9-3c48a195381ad55f9fef7e9bf32b1593d43368ca-head:

Scenario 1: Scaling test

bk201 commented Jun 28, 2024

snasovich commented Jun 28, 2024

bk201 commented Jul 1, 2024

jakefhyde commented Aug 30, 2023 •

edited by Sahota1225

Loading

sennerholm commented Sep 28, 2023 •

edited

Loading

Verified with (HA Helm or Docker) on Rancher `v2.9-3c48a195381ad55f9fef7e9bf32b1593d43368ca-head`: