Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

You can add two vGPUs for RKE2 node on Harvester even though they won't start #10947

Closed
noahgildersleeve opened this issue May 4, 2024 · 2 comments · Fixed by #10991
Closed
Assignees
Milestone

Comments

@noahgildersleeve
Copy link
Contributor

Setup

  • Rancher version: v2.8-head
  • Rancher UI Extensions:
  • Browser type & version: Chrome

Describe the bug

You can add two vGPU devices for a RKE2 Harvester cluster. They won't start due to the need for the YAML ramFB fix.
Also Rancher stays in starting state and doesn’t go into an error state. The fail state will keep looping since it’s not reporting back the error
To Reproduce

  1. Create a new RKE2 cluster with Harvester as the provider
  2. Add two vGPUs to the cluster
  3. Create the cluster
  4. Check the status of the VMs after creation
    Result
    The VMs won't start
    Expected Result

There are few possible fixes

  • The VMs should start and the second vGPU should have the ramFB fix
  • The Add button should be disabled after adding a vPGU
  • We could add a link to the
    Screenshots

Greenshot 2024-05-03 17 55 22

Additional context

Currently this is allowed in the Harvester UI, but you have to do the YAML fix after creation, or possibly via YAML during creation.

@github-actions github-actions bot added the QA/dev-automation Issues that engineers have written automation around so QA doesn't have look at this label May 4, 2024
@torchiaf
Copy link
Member

torchiaf commented May 6, 2024

We need to confirm that we don't have a unique key problem. I'm quite sure that it's just a label issue, where we should show the node id + profile id.
For the YAML ramFB fix we should clarify what would be the logic to disable buttons.

@noahgildersleeve
Copy link
Contributor Author

noahgildersleeve commented May 8, 2024

Something to be aware of is that if you go in and edit the YAML for the ramFB fix and the node driver ever updates the node it will overwrite the fix. This will happen every time the node is redeployed, such as for the node going into error, unresponsive, or other states. Also when you change the count.

For a quick fix I would suggest just disabling the add button after adding one vGPU profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants