Provisioning V2 / RKEv2 does not work with third party node drivers #37074

maxaudron · 2022-03-28T11:45:12Z

Rancher Server Setup

Rancher version: 2.6.3
Installation option (Docker install/Helm Chart): docker

Information about the Cluster

Kubernetes version: v1.21.9+rke2r1
Cluster Type (Local/Downstream): Infrastructure using third party node driver

User Information

What is the role of the user logged in?: Admin

Describe the bug

When trying to provision a cluster with a third party node
driver, that isn't a builtin, provisioning of a rke2 cluster fails.

Third Party node drivers added to rancher get a randomly assigned name as their
kubernetes resource name:

harvester       186d
linode          312d
nd-rjl9k        44m

But rke2 assumes the name when trying to provision machines leading to an error:
This error also isn't logged properly in the UI. There machines are only
described as Waiting to schedule machine create.

  status:
    conditions:
    - message: nodedrivers.management.cattle.io "nutanix" not found
      reason: Error
      status: "False"
      type: CreateJob

I assume this piece of code is responsible: https://github.com/rancher/rancher/blob/release/v2.6/pkg/controllers/provisioningv2/rke2/machineprovision/args.go#L298

func getNodeDriverName(typeMeta meta.Type) string {
	return strings.ToLower(strings.TrimSuffix(typeMeta.GetKind(), "Machine"))
}

Which simply takes the Kind of the machine crd, in my case NutanixMachine.

To Reproduce

Add a third party node driver
Spawn a rke2 cluster

Result
The cluster is stuck in provisioning and only showing
Waiting to schedule machine create as status for machines

Expected Result
The cluster provisions sucessfully

Workaround
Create the nodedriver manually in the backing kubernetes cluster with the correct name

The text was updated successfully, but these errors were encountered:

github-actions · 2022-05-28T02:09:31Z

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

maxaudron · 2022-05-29T09:57:26Z

still a problem

github-actions · 2022-07-29T02:15:03Z

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

maxaudron · 2022-07-29T09:09:41Z

still a problem

github-actions · 2022-09-29T02:16:17Z

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

avirtopeanu-ionos · 2023-02-24T08:24:47Z

we're having the same issue here with our custom node driver. @maxaudron did you end up fixing this issue somehow?

naibaf0 · 2023-04-04T12:26:39Z

I work with @maxaudron Unfortunately we didn't fix the issue yet.

sarahhenkens · 2023-04-28T05:58:11Z

still a problem

sarahhenkens · 2023-04-28T05:58:42Z

Can we re-open this ticket? Its not possible to provision RKE2 clusters with the nutanix node driver

brandond · 2023-04-28T06:57:49Z

Workaround
Create the nodedriver manually in the backing kubernetes cluster with the correct name

Need to fix the v2prov assumption that NodeDriver resources will have a specific name; or fix how 3rd party node drivers are named when installed.

azzaka · 2023-07-25T11:27:21Z

Hopefully this is still being worked on. It is currently a Blocker for me rolling out Rancher across our estate.

jakefhyde · 2023-08-03T22:32:11Z

This is bug is caused by the fact that node drivers were initially designed to be decoupled from the name with RKE1, however with v2prov (and specifically CAPI), CRDs are required (which was correctly pointed out here: #37074 (comment)). The linked code https://github.com/rancher/rancher/blob/release/v2.6/pkg/controllers/provisioningv2/rke2/machineprovision/args.go#L298 is responsible, however the true underlying culprit comes from here: https://github.com/rancher/rancher/blob/e5cc549591fbdf6aec91915b83384cd78b56f769/pkg/controllers/management/drivers/nodedriver/machine_driver.go#L224C54-L224C54. This piece of code uses the displayName of the node driver object, which is not settable at creation time from the UI. Additionally, there is no validation in place to prevent multiple node drivers from using the same displayName, which will cause the dynamic schema to thrash and potentially cause data loss, or from changing the displayName, which would also result in data loss. Although one can set this displayName manually, this is not a suitable long term solution.

A potential long term solution would be for the backend to use the k8s metadata name (which corresponds to the norman id), however the UI is using norman, and I was not able to create a node driver whilst specifying the id in a POST request using curl. This requires input from @rancher/rancher-team-1-neo-dev as to whether or not it is possible from within norman to specify the id in the request. There is no way to remove the rancher requirement that the names have to be unique due to the generated CRDs, and validating that all node drivers have a different display name would not be a suitable alternative as opposed to just using the name of the corresponding nodedriver CR. cc @gaktive

Workaround

The below script outlines a workaround, assuming one has already encounted the issue when attempting to create a third party driver in the rancher UI with the correct url. The node driver should be inactive before running this script, as deactivation causes CRs to be cleaned up.

(export NAME="<DRIVER NAME (must be [a-z]*)>" NODEDRIVER="<DRIVER ID (e.g. nd-12345)>"; kubectl get nodedriver "${NODEDRIVER}" -o yaml | yq 'del(.status) | .metadata |= with_entries(select(.key == "annotations")) | .metadata.annotations |= with_entries(select(.key == "publicCredentialFields" or .key == "privateCredentialFields"))' | yq ".metadata.name = strenv(NAME)" | yq ".spec.displayName = .metadata.name")

After this, the original node driver (with prefix nd-xxxxx) can and should be deleted, as the two node drivers will thrash if both are activated, fighting for ownership of the dynamic schema object which in turn will create the CAPI infrastructure machine and machine template CRDs.

These are the minimum required fields to create a node driver. Once this yaml is retrieved, it can be piped to kubectl apply and the correspondingly generated node driver should be created.

Note: you cannot attempt you use this script to create a node driver with an identical name or displayName to another, it won't work as they are backed by k8s CRs.

gaktive · 2023-08-04T15:01:11Z

@jakefhyde can you file a ticket in rancher/dashboard and link back here?

snasovich · 2023-08-09T18:54:43Z

@rancher/docs , FYI moved this to "Release Note" status as we would want to include the workaround #37074 (comment) in the next release notes, not specifically 2023-Q4/2024-Q1 releases.

jakefhyde · 2023-08-09T20:06:23Z

@gaktive Holding off on creating dashboard ticket for now, may require some additional work.

martyav · 2023-08-10T18:42:22Z

@snasovich do you mean the emergency release we're currently working on, or the one after?

snasovich · 2023-08-14T15:04:52Z

@martyav , any next release. Won't hurt to put in the out-of-band release you mentioned - but we will want to retain it in Q3/v2.7-Next release notes as well as this won't be fixed in it.

snasovich · 2024-03-08T17:03:03Z

Removing milestone from this issue as it's unlikely we will get to it soon especially given workaround exists.

github-actions bot added the status/stale label May 28, 2022

github-actions bot removed the status/stale label May 30, 2022

github-actions bot added the status/stale label Jul 29, 2022

github-actions bot removed the status/stale label Jul 30, 2022

github-actions bot added the status/stale label Sep 29, 2022

github-actions bot closed this as completed Oct 14, 2022

brandond reopened this Apr 28, 2023

jakefhyde added the [zube]: To Triage label Jun 15, 2023

jakefhyde added this to the 2023-Q3-v2.7x milestone Jun 15, 2023

snasovich modified the milestones: 2023-Q3-v2.7x, 2023-Q4-v2.8x Jul 13, 2023

snasovich added the [zube]: Release Candidates label Jul 13, 2023

zube bot removed the [zube]: To Triage label Jul 13, 2023

This was referenced Jul 13, 2023

38481 download custom drivers each pod #42076

Merged

42128 download custom drivers each pod #42160

Merged

snasovich assigned jakefhyde Jul 25, 2023

This was referenced Jul 26, 2023

[BUG] RKE2 provisioning with custom NodeDriver fails when rancher has more than one replica #38481

Closed

[Backport v2.7.6] [BUG] RKE2 provisioning with custom NodeDriver fails when rancher has more than one replica #42128

Closed

snasovich added the workaround-available label Aug 9, 2023

snasovich modified the milestones: 2023-Q4-v2.8x, 2024-Q1-v2.8x Aug 9, 2023

snasovich added release-note Note this issue in the milestone's release notes [zube]: Release Note labels Aug 9, 2023

zube bot removed the [zube]: Release Candidates label Aug 9, 2023

daviswill2 assigned Josh-Diamond Feb 28, 2024

Jono-SUSE-Rancher removed the [zube]: Release Note label Feb 29, 2024

snasovich removed this from the v2.8-Next1 milestone Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provisioning V2 / RKEv2 does not work with third party node drivers #37074

Provisioning V2 / RKEv2 does not work with third party node drivers #37074

maxaudron commented Mar 28, 2022 •

edited

Loading

github-actions bot commented May 28, 2022

maxaudron commented May 29, 2022

github-actions bot commented Jul 29, 2022

maxaudron commented Jul 29, 2022

github-actions bot commented Sep 29, 2022

avirtopeanu-ionos commented Feb 24, 2023

naibaf0 commented Apr 4, 2023

sarahhenkens commented Apr 28, 2023

sarahhenkens commented Apr 28, 2023

brandond commented Apr 28, 2023

azzaka commented Jul 25, 2023

jakefhyde commented Aug 3, 2023 •

edited

Loading

gaktive commented Aug 4, 2023

snasovich commented Aug 9, 2023

jakefhyde commented Aug 9, 2023

martyav commented Aug 10, 2023

snasovich commented Aug 14, 2023

snasovich commented Mar 8, 2024

Provisioning V2 / RKEv2 does not work with third party node drivers #37074

Provisioning V2 / RKEv2 does not work with third party node drivers #37074

Comments

maxaudron commented Mar 28, 2022 • edited Loading

github-actions bot commented May 28, 2022

maxaudron commented May 29, 2022

github-actions bot commented Jul 29, 2022

maxaudron commented Jul 29, 2022

github-actions bot commented Sep 29, 2022

avirtopeanu-ionos commented Feb 24, 2023

naibaf0 commented Apr 4, 2023

sarahhenkens commented Apr 28, 2023

sarahhenkens commented Apr 28, 2023

brandond commented Apr 28, 2023

azzaka commented Jul 25, 2023

jakefhyde commented Aug 3, 2023 • edited Loading

Workaround

gaktive commented Aug 4, 2023

snasovich commented Aug 9, 2023

jakefhyde commented Aug 9, 2023

martyav commented Aug 10, 2023

snasovich commented Aug 14, 2023

snasovich commented Mar 8, 2024

maxaudron commented Mar 28, 2022 •

edited

Loading

jakefhyde commented Aug 3, 2023 •

edited

Loading