CAPM3 v1.6.1 CAPI/CAPM3 machine name changed after rolling upgrade while nodeReuse set to True #1584

jparkash2 · 2024-04-08T03:57:10Z

What steps did you take and what happened:
Longhorn doesn't reuse old replicas after rolling upgrades and creates new replicas instead.

We conducted the rolling upgrade of CAPI with specific configurations, setting nodeResue to true, automatedCleaningMode to disabled at Metal3, and replica-replenishment-wait-interval set to 3600 on Longhorn. However, despite these settings, Longhorn did not utilize existing data. Instead, it created a new replica copy from the existing copy after the timeout of replica-replenishment-wait-interval even the node joined back the longhorn/kubernates cluster within the replica-replenishment-wait-interval time but with a new name. This behaviour was unexpected and not in line with our testing expectations.

After CAPI rolling upgrade, we have the below observations regarding Longhorn:

longhorn doesn't use old replica
It creates a new replica after the expiration of replica-replenishment-wait-interval
It is observed that during the provisioning of baremetal, CAPI machine/node name got changed. Due to this, longhorn consider it as a new node and doesn't use the existing longhorn replica.

Upon further investigation, we identified the root cause of the issue. It appears that during the rolling upgrade, the CAPI machine name changed, resulting in Kubernetes/Longhorn cluster treating the updated node as a new node rather than the existing one.

What did you expect to happen:
After the rolling upgrade Kubernates cluster node name should not change to use existing data on disk to rebuild the replicas instead of create the new copy of replicas.

Longhorn should not create new replica and should reuse existing (i.e. old) replica
Longhorn should rebuild the data using existing (i.e. old) replica

Anything else you would like to add:
https://gitlab.com/sylva-projects/sylva-core/-/issues/1141

Environment:

Cluster-api version: v1.6.3
Cluster-api-provider-metal3 version: v1.6.1
Environment (metal3-dev-env or other): management cluster on CAPO and workload cluster on Baremetal
Kubernetes version: (use kubectl version): 1.27.10

/kind bug

The text was updated successfully, but these errors were encountered:

metal3-io-bot · 2024-04-08T03:57:19Z

This issue is currently awaiting triage.
If Metal3.io contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hardys · 2024-04-08T16:31:19Z

Hi @jparkash2 thanks for the report!

I think for this to be actionable it needs to focus only on the expected/desired CAPM3 behavior, since handling of the disks by longhorn is outside the scope of the CAPM3 component.

You mentioned nodeResue is set to true, automatedCleaningMode to disabled - did this result in the existing BMH getting reused during the upgrade, and was the data on the storage disk retained?

How the data is handled by Longhorn (or any other layered dependency) is not controlled via CAPM3, but I can see how replacing the Machine/Node CRs could potentially cause problems, so perhaps we can firstly determine if CAPM3 is behaving unexpectedly, or if it's working as-designed (but that causes undesired side-effects due to the way CAPI upgrades work e.g via Machine replacement)?

Rozzii · 2024-04-10T14:32:21Z

/triage needs-information

metal3-io-bot · 2024-07-09T14:53:32Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

guettli · 2024-07-09T15:08:59Z

@jparkash2 were you able to solve that issue? Is Longhorn able to re-use the data on different disk after capi did a re-provisioning?

I think Longhorn is not able to do that. But maybe I am wrong, I created a feature request for Longhorn:

longhorn/longhorn#8962

jparkash2 · 2024-07-09T15:43:42Z

@guettli Thanks, we can close this thread as it was achieved by making changes at CAMP3.

guettli · 2024-07-10T08:08:21Z

@jparkash2 how did you solve that? The issue at Longhorn is still open: longhorn/longhorn#8362

hardys · 2024-07-10T15:50:15Z

We can cross-reference the related Sylva issues which I think contain some details around how this was resolved;

I think this is not a CAPM3 issue, but steps must be taken to ensure that cleaning is disabled and the Node name is reused which should perhaps be better documented somewhere

guettli · 2024-07-10T19:41:20Z

@hardys we use constant node names. Unfortunately I don't see how to solve that.

The Longhorn node resource contains the config which you configure via the GUI. If cluster API upgrades the node, the Kubernetes node gets deleted, and the Longhorn node object gets deleted, too. Now the configuration like tags are lost.

We think about that: a controller syncs the Longhorn node config into a new CRD or a config map.

When the node gets created, we can provide this config via longhorn default config annotation, or via update the Longhorn node resource or we use the Longhorn Python client.

All this is doable. I am just confused that we seem to be the first to automate that.

I like software development, but I am also happy if it's not needed :-)

How do you attach the data disk to Longhorn? Is a manual re-attach needed after capi upgraded the machine?

metal3-io-bot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Apr 8, 2024

jparkash2 changed the title ~~CAPM3 v1.6.1 Longhorn failed to use exiting data after rolling upgrade as machine name changed~~ CAPM3 v1.6.1 Longhorn failed to use existing data after rolling upgrade as CAPI machine name changed Apr 8, 2024

jparkash2 changed the title ~~CAPM3 v1.6.1 Longhorn failed to use existing data after rolling upgrade as CAPI machine name changed~~ CAPM3 v1.6.1 CAPI/CAPM3 machine name changed after rolling upgrade while nodeReuse set to True Apr 8, 2024

metal3-io-bot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Apr 10, 2024

mantissahz mentioned this issue Apr 16, 2024

[IMPROVEMENT] Investigate how Longhorn is working with CAPI rolling upgrade longhorn/longhorn#8362

Open

metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 9, 2024

jparkash2 closed this as completed Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAPM3 v1.6.1 CAPI/CAPM3 machine name changed after rolling upgrade while nodeReuse set to True #1584

CAPM3 v1.6.1 CAPI/CAPM3 machine name changed after rolling upgrade while nodeReuse set to True #1584

jparkash2 commented Apr 8, 2024

metal3-io-bot commented Apr 8, 2024

hardys commented Apr 8, 2024 •

edited

Loading

Rozzii commented Apr 10, 2024

metal3-io-bot commented Jul 9, 2024

guettli commented Jul 9, 2024

jparkash2 commented Jul 9, 2024

guettli commented Jul 10, 2024

hardys commented Jul 10, 2024

guettli commented Jul 10, 2024 •

edited

Loading

CAPM3 v1.6.1 CAPI/CAPM3 machine name changed after rolling upgrade while nodeReuse set to True #1584

CAPM3 v1.6.1 CAPI/CAPM3 machine name changed after rolling upgrade while nodeReuse set to True #1584

Comments

jparkash2 commented Apr 8, 2024

metal3-io-bot commented Apr 8, 2024

hardys commented Apr 8, 2024 • edited Loading

Rozzii commented Apr 10, 2024

metal3-io-bot commented Jul 9, 2024

guettli commented Jul 9, 2024

jparkash2 commented Jul 9, 2024

guettli commented Jul 10, 2024

hardys commented Jul 10, 2024

guettli commented Jul 10, 2024 • edited Loading

hardys commented Apr 8, 2024 •

edited

Loading

guettli commented Jul 10, 2024 •

edited

Loading