Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provisioning V2 / RKEv2 does not work with third party node drivers #37074

Open
maxaudron opened this issue Mar 28, 2022 · 18 comments
Open

Provisioning V2 / RKEv2 does not work with third party node drivers #37074

maxaudron opened this issue Mar 28, 2022 · 18 comments
Assignees
Labels
area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework feature/node-drivers kind/bug Issues that are defects reported by users or that we know have reached a real release release-note Note this issue in the milestone's release notes team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support workaround-available

Comments

@maxaudron
Copy link

maxaudron commented Mar 28, 2022

Rancher Server Setup

  • Rancher version: 2.6.3
  • Installation option (Docker install/Helm Chart): docker

Information about the Cluster

  • Kubernetes version: v1.21.9+rke2r1
  • Cluster Type (Local/Downstream): Infrastructure using third party node driver

User Information

  • What is the role of the user logged in?: Admin

Describe the bug

When trying to provision a cluster with a third party node
driver, that isn't a builtin, provisioning of a rke2 cluster fails.

Third Party node drivers added to rancher get a randomly assigned name as their
kubernetes resource name:

harvester       186d
linode          312d
nd-rjl9k        44m

But rke2 assumes the name when trying to provision machines leading to an error:
This error also isn't logged properly in the UI. There machines are only
described as Waiting to schedule machine create.

  status:
    conditions:
    - message: nodedrivers.management.cattle.io "nutanix" not found
      reason: Error
      status: "False"
      type: CreateJob

I assume this piece of code is responsible: https://github.com/rancher/rancher/blob/release/v2.6/pkg/controllers/provisioningv2/rke2/machineprovision/args.go#L298

func getNodeDriverName(typeMeta meta.Type) string {
	return strings.ToLower(strings.TrimSuffix(typeMeta.GetKind(), "Machine"))
}

Which simply takes the Kind of the machine crd, in my case NutanixMachine.

To Reproduce

  1. Add a third party node driver
  2. Spawn a rke2 cluster

Result
The cluster is stuck in provisioning and only showing
Waiting to schedule machine create as status for machines

Expected Result
The cluster provisions sucessfully

Workaround
Create the nodedriver manually in the backing kubernetes cluster with the correct name

@github-actions
Copy link
Contributor

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

@maxaudron
Copy link
Author

still a problem

@github-actions
Copy link
Contributor

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

@maxaudron
Copy link
Author

still a problem

@github-actions
Copy link
Contributor

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

@avirtopeanu-ionos
Copy link

we're having the same issue here with our custom node driver. @maxaudron did you end up fixing this issue somehow?

@naibaf0
Copy link

naibaf0 commented Apr 4, 2023

I work with @maxaudron Unfortunately we didn't fix the issue yet.

@sarahhenkens
Copy link

still a problem

@sarahhenkens
Copy link

Can we re-open this ticket? Its not possible to provision RKE2 clusters with the nutanix node driver

@brandond brandond reopened this Apr 28, 2023
@brandond brandond added feature/node-drivers kind/bug Issues that are defects reported by users or that we know have reached a real release area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support and removed status/stale labels Apr 28, 2023
@brandond
Copy link
Contributor

Workaround
Create the nodedriver manually in the backing kubernetes cluster with the correct name

Need to fix the v2prov assumption that NodeDriver resources will have a specific name; or fix how 3rd party node drivers are named when installed.

@azzaka
Copy link

azzaka commented Jul 25, 2023

Hopefully this is still being worked on. It is currently a Blocker for me rolling out Rancher across our estate.

@jakefhyde
Copy link
Contributor

jakefhyde commented Aug 3, 2023

This is bug is caused by the fact that node drivers were initially designed to be decoupled from the name with RKE1, however with v2prov (and specifically CAPI), CRDs are required (which was correctly pointed out here: #37074 (comment)). The linked code https://github.com/rancher/rancher/blob/release/v2.6/pkg/controllers/provisioningv2/rke2/machineprovision/args.go#L298 is responsible, however the true underlying culprit comes from here: https://github.com/rancher/rancher/blob/e5cc549591fbdf6aec91915b83384cd78b56f769/pkg/controllers/management/drivers/nodedriver/machine_driver.go#L224C54-L224C54. This piece of code uses the displayName of the node driver object, which is not settable at creation time from the UI. Additionally, there is no validation in place to prevent multiple node drivers from using the same displayName, which will cause the dynamic schema to thrash and potentially cause data loss, or from changing the displayName, which would also result in data loss. Although one can set this displayName manually, this is not a suitable long term solution.

A potential long term solution would be for the backend to use the k8s metadata name (which corresponds to the norman id), however the UI is using norman, and I was not able to create a node driver whilst specifying the id in a POST request using curl. This requires input from @rancher/rancher-team-1-neo-dev as to whether or not it is possible from within norman to specify the id in the request. There is no way to remove the rancher requirement that the names have to be unique due to the generated CRDs, and validating that all node drivers have a different display name would not be a suitable alternative as opposed to just using the name of the corresponding nodedriver CR. cc @gaktive

Workaround

The below script outlines a workaround, assuming one has already encounted the issue when attempting to create a third party driver in the rancher UI with the correct url. The node driver should be inactive before running this script, as deactivation causes CRs to be cleaned up.

(export NAME="<DRIVER NAME (must be [a-z]*)>" NODEDRIVER="<DRIVER ID (e.g. nd-12345)>"; kubectl get nodedriver "${NODEDRIVER}" -o yaml | yq 'del(.status) | .metadata |= with_entries(select(.key == "annotations")) | .metadata.annotations |= with_entries(select(.key == "publicCredentialFields" or .key == "privateCredentialFields"))' | yq ".metadata.name = strenv(NAME)" | yq ".spec.displayName = .metadata.name")

After this, the original node driver (with prefix nd-xxxxx) can and should be deleted, as the two node drivers will thrash if both are activated, fighting for ownership of the dynamic schema object which in turn will create the CAPI infrastructure machine and machine template CRDs.

These are the minimum required fields to create a node driver. Once this yaml is retrieved, it can be piped to kubectl apply and the correspondingly generated node driver should be created.

Note: you cannot attempt you use this script to create a node driver with an identical name or displayName to another, it won't work as they are backed by k8s CRs.

@gaktive
Copy link
Member

gaktive commented Aug 4, 2023

@jakefhyde can you file a ticket in rancher/dashboard and link back here?

@snasovich snasovich modified the milestones: 2023-Q4-v2.8x, 2024-Q1-v2.8x Aug 9, 2023
@snasovich snasovich added release-note Note this issue in the milestone's release notes [zube]: Release Note labels Aug 9, 2023
@snasovich
Copy link
Collaborator

@rancher/docs , FYI moved this to "Release Note" status as we would want to include the workaround #37074 (comment) in the next release notes, not specifically 2023-Q4/2024-Q1 releases.

@jakefhyde
Copy link
Contributor

@gaktive Holding off on creating dashboard ticket for now, may require some additional work.

@martyav
Copy link
Contributor

martyav commented Aug 10, 2023

@snasovich do you mean the emergency release we're currently working on, or the one after?

@snasovich
Copy link
Collaborator

@martyav , any next release. Won't hurt to put in the out-of-band release you mentioned - but we will want to retain it in Q3/v2.7-Next release notes as well as this won't be fixed in it.

@snasovich
Copy link
Collaborator

Removing milestone from this issue as it's unlikely we will get to it soon especially given workaround exists.

@snasovich snasovich removed this from the v2.8-Next1 milestone Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework feature/node-drivers kind/bug Issues that are defects reported by users or that we know have reached a real release release-note Note this issue in the milestone's release notes team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support workaround-available
Projects
None yet
Development

No branches or pull requests