-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
intel device plugins gpu : failed to call webhook, context deadline exceeded #1658
Comments
Hi @Llyr95 The webhook takes some time to get up, so if you try to install the CR "too soon" it may fail. Or maybe the webhook part of the operator is misbehaving. Can you check if the controller-manager pod is fully up and running? If it isn't, can you share the logs: |
Hi @tkatila, Thank you for answering Here is the logs you asked for :
|
Thanks, the logs seem ok. If you try to re-apply the GPU CR, does it still fail? |
I have tried to reinstall the gpu plugin but I don't understand about the CR, from my testing, the custom resources definitions are installed with the operator helm charts. I install the gpu plugin helm charts after. So how can I create the custom resources definitions of the operator before it creates the Webhook ? Or there is something I don't understand I have tried running the daemonset via |
You can't really change the creation order. The operator chart creates the CRDs and the GPU plugin chart initiates a CR. The reason why I asked about the re-creation is timing. The operator helm chart installs the CRDs and the operator Pod. But helm doesn't (unless asked) wait for the Pods to become available. The webhook especially takes some time to come up and if the GPU CR is deployed during that time, it will fail. With helm cli, if you install the operator and the gpu back-to-back: The fix for this is to tell the helm cli to wait for the deployment (--wait): I'm not familiar with flux so I don't know which way it functions. Another thing to try is, when the GPU CR part has failed, wait a few seconds and try to create the GPU CR from the device plugins project: If the creation succeeds, then the underlying issue is about timing. If it still fails, it's something related to the environment which requires more debug. |
I'm not sure whether it's relevant here (Tuomas?), but Helm tool supports only (initial) CRD install, not CRD upgrades. AFAIK proper upgrade of changed CRDs would require them to be removed (manually) before using e.g. Helm to install new ones... (Helm project has lot of tickets about that, and a long document about the corner-cases that are the reason why Helm tool chooses not to support CRD removal/upgrades.) |
The CRD install doesn't seem to be the issue. The failure would be different. |
Ok so I did some testing I tried to install the GPU CR manually like @tkatila said and I had the same error After removing everything, I tested At one point, I tried your first command to check ( I didn't understand why so I figured it was because I forgot the version (I am on k8s v1.28.9) but it didn't make sense because if it would not work, it would be doing versions 0.29.0 on k8s 1.28.9 and not the other way around. Afterwards, I found out that if I installed the operator in the Thank you very much for your help! |
Good that you got it working! I don't understand why 'system' ns would cause the webhook to break. We typically use 'intel' or 'inteldeviceplugins' ns without issues. As you are using Talos, have you decreased the pod-security for the default namespace? I recall that Talos has quite strict pod-security settings by default that can cause the issues with Pods not running. I wouldn't be surprised if there were some network access limitations as well. |
In intel/helm-charts#46 we showed that the namespace does not matter. Anyway, closing. |
Describe the support request
I am trying to install the intel device plugins gpu helm after installing the operator helm chart. It fails with
Helm install failed for release system/intel-device-plugin-gpu with chart [email protected]: 1 error occurred: * Internal error occurred: failed calling webhook "mgpudeviceplugin.kb.io": failed to call webhook: Post "https://inteldeviceplugins-webhook-service.system.svc:443/mutate-deviceplugin-intel-com-v1-gpudeviceplugin?timeout=10s": context deadline exceeded
The helm chart is installed as a helmrelease via flux.
Thanks for you help
System (please complete the following information if applicable):
The text was updated successfully, but these errors were encountered: