intel device plugins gpu : failed to call webhook, context deadline exceeded #1658

Llyr95 · 2024-01-28T10:45:15Z

Describe the support request
I am trying to install the intel device plugins gpu helm after installing the operator helm chart. It fails with

Helm install failed for release system/intel-device-plugin-gpu with chart [email protected]: 1 error occurred: * Internal error occurred: failed calling webhook "mgpudeviceplugin.kb.io": failed to call webhook: Post "https://inteldeviceplugins-webhook-service.system.svc:443/mutate-deviceplugin-intel-com-v1-gpudeviceplugin?timeout=10s": context deadline exceeded

The helm chart is installed as a helmrelease via flux.

Thanks for you help

System (please complete the following information if applicable):

OS version: Talos v1.6.1
Device plugins version: v0.28.0
Hardware info: 3 hp prodesk mini pc gen 4

The text was updated successfully, but these errors were encountered:

tkatila · 2024-01-29T07:17:04Z

Hi @Llyr95

The webhook takes some time to get up, so if you try to install the CR "too soon" it may fail. Or maybe the webhook part of the operator is misbehaving.

Can you check if the controller-manager pod is fully up and running? If it isn't, can you share the logs:
kubectl logs -n <namespace> inteldeviceplugins-controller-manager-something-anything -c kube-rbac-proxy

Llyr95 · 2024-01-29T08:30:36Z

Hi @tkatila,

Thank you for answering

Here is the logs you asked for :

I0128 16:55:26.266275       1 flags.go:64] FLAG: --add-dir-header="false"
I0128 16:55:26.266322       1 flags.go:64] FLAG: --allow-paths="[]"
I0128 16:55:26.266328       1 flags.go:64] FLAG: --alsologtostderr="false"
I0128 16:55:26.266331       1 flags.go:64] FLAG: --auth-header-fields-enabled="false"
I0128 16:55:26.266335       1 flags.go:64] FLAG: --auth-header-groups-field-name="x-remote-groups"
I0128 16:55:26.266340       1 flags.go:64] FLAG: --auth-header-groups-field-separator="|"
I0128 16:55:26.266343       1 flags.go:64] FLAG: --auth-header-user-field-name="x-remote-user"
I0128 16:55:26.266346       1 flags.go:64] FLAG: --auth-token-audiences="[]"
I0128 16:55:26.266350       1 flags.go:64] FLAG: --client-ca-file=""
I0128 16:55:26.266353       1 flags.go:64] FLAG: --config-file=""
I0128 16:55:26.266355       1 flags.go:64] FLAG: --help="false"
I0128 16:55:26.266359       1 flags.go:64] FLAG: --ignore-paths="[]"
I0128 16:55:26.266362       1 flags.go:64] FLAG: --insecure-listen-address=""
I0128 16:55:26.266365       1 flags.go:64] FLAG: --kubeconfig=""
I0128 16:55:26.266368       1 flags.go:64] FLAG: --log-backtrace-at=":0"
I0128 16:55:26.266373       1 flags.go:64] FLAG: --log-dir=""
I0128 16:55:26.266376       1 flags.go:64] FLAG: --log-file=""
I0128 16:55:26.266379       1 flags.go:64] FLAG: --log-file-max-size="1800"
I0128 16:55:26.266382       1 flags.go:64] FLAG: --log-flush-frequency="5s"
I0128 16:55:26.266385       1 flags.go:64] FLAG: --logtostderr="true"
I0128 16:55:26.266388       1 flags.go:64] FLAG: --oidc-ca-file=""
I0128 16:55:26.266391       1 flags.go:64] FLAG: --oidc-clientID=""
I0128 16:55:26.266394       1 flags.go:64] FLAG: --oidc-groups-claim="groups"
I0128 16:55:26.266397       1 flags.go:64] FLAG: --oidc-groups-prefix=""
I0128 16:55:26.266399       1 flags.go:64] FLAG: --oidc-issuer=""
I0128 16:55:26.266402       1 flags.go:64] FLAG: --oidc-sign-alg="[RS256]"
I0128 16:55:26.266408       1 flags.go:64] FLAG: --oidc-username-claim="email"
I0128 16:55:26.266411       1 flags.go:64] FLAG: --one-output="false"
I0128 16:55:26.266414       1 flags.go:64] FLAG: --proxy-endpoints-port="0"
I0128 16:55:26.266417       1 flags.go:64] FLAG: --secure-listen-address="0.0.0.0:8443"
I0128 16:55:26.266420       1 flags.go:64] FLAG: --skip-headers="false"
I0128 16:55:26.266423       1 flags.go:64] FLAG: --skip-log-headers="false"
I0128 16:55:26.266426       1 flags.go:64] FLAG: --stderrthreshold="2"
I0128 16:55:26.266429       1 flags.go:64] FLAG: --tls-cert-file=""
I0128 16:55:26.266432       1 flags.go:64] FLAG: --tls-cipher-suites="[TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305]"
I0128 16:55:26.266440       1 flags.go:64] FLAG: --tls-min-version="VersionTLS12"
I0128 16:55:26.266444       1 flags.go:64] FLAG: --tls-private-key-file=""
I0128 16:55:26.266446       1 flags.go:64] FLAG: --tls-reload-interval="1m0s"
I0128 16:55:26.266451       1 flags.go:64] FLAG: --upstream="http:https://127.0.0.1:8080/"
I0128 16:55:26.266454       1 flags.go:64] FLAG: --upstream-ca-file=""
I0128 16:55:26.266457       1 flags.go:64] FLAG: --upstream-client-cert-file=""
I0128 16:55:26.266460       1 flags.go:64] FLAG: --upstream-client-key-file=""
I0128 16:55:26.266463       1 flags.go:64] FLAG: --upstream-force-h2c="false"
I0128 16:55:26.266466       1 flags.go:64] FLAG: --v="10"
I0128 16:55:26.266469       1 flags.go:64] FLAG: --version="false"
I0128 16:55:26.266473       1 flags.go:64] FLAG: --vmodule=""
W0128 16:55:26.266730       1 kube-rbac-proxy.go:152] 
==== Deprecation Warning ======================

Insecure listen address will be removed.
Using --insecure-listen-address won't be possible!

The ability to run kube-rbac-proxy without TLS certificates will be removed.
Not using --tls-cert-file and --tls-private-key-file won't be possible!

For more information, please go to https://github.com/brancz/kube-rbac-proxy/issues/187

===============================================

		
I0128 16:55:26.266757       1 kube-rbac-proxy.go:272] Valid token audiences: 
I0128 16:55:26.266788       1 kube-rbac-proxy.go:363] Generating self signed cert as no cert is provided
I0128 16:55:26.435337       1 kube-rbac-proxy.go:414] Starting TCP socket on 0.0.0.0:8443
I0128 16:55:26.435488       1 kube-rbac-proxy.go:421] Listening securely on 0.0.0.0:8443

tkatila · 2024-01-29T08:44:52Z

Thanks, the logs seem ok.

If you try to re-apply the GPU CR, does it still fail?

Llyr95 · 2024-01-29T11:09:29Z

I have tried to reinstall the gpu plugin but I don't understand about the CR, from my testing, the custom resources definitions are installed with the operator helm charts. I install the gpu plugin helm charts after.

So how can I create the custom resources definitions of the operator before it creates the Webhook ? Or there is something I don't understand

I have tried running the daemonset via kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/gpu_plugin/overlays/nfd_labeled_nodes?ref=<RELEASE_VERSION>' from https://intel.github.io/intel-device-plugins-for-kubernetes/cmd/gpu_plugin/README.html#install-with-nfd and that worked. However the goal is to install the operator and gpu plugin helm charts

tkatila · 2024-01-29T11:49:20Z

You can't really change the creation order. The operator chart creates the CRDs and the GPU plugin chart initiates a CR.

The reason why I asked about the re-creation is timing. The operator helm chart installs the CRDs and the operator Pod. But helm doesn't (unless asked) wait for the Pods to become available. The webhook especially takes some time to come up and if the GPU CR is deployed during that time, it will fail.

With helm cli, if you install the operator and the gpu back-to-back:
helm install operator intel/intel-device-plugins-operator && helm install gpu intel/intel-device-plugins-gpu --set nodeFeatureRule=true
The second part might fail as the webhook is not yet running.

The fix for this is to tell the helm cli to wait for the deployment (--wait):
helm install --wait operator intel/intel-device-plugins-operator && helm install gpu intel/intel-device-plugins-gpu --set nodeFeatureRule=true

I'm not familiar with flux so I don't know which way it functions.

Another thing to try is, when the GPU CR part has failed, wait a few seconds and try to create the GPU CR from the device plugins project:
curl 'https://raw.githubusercontent.com/intel/intel-device-plugins-for-kubernetes/v0.28.0/deployments/operator/samples/deviceplugin_v1_gpudeviceplugin.yaml' | kubectl create -f -

If the creation succeeds, then the underlying issue is about timing. If it still fails, it's something related to the environment which requires more debug.

eero-t · 2024-01-29T12:03:55Z

I have tried to reinstall the gpu plugin but I don't understand about the CR, from my testing, the custom resources definitions are installed with the operator helm charts. I install the gpu plugin helm charts after.

I'm not sure whether it's relevant here (Tuomas?), but Helm tool supports only (initial) CRD install, not CRD upgrades. AFAIK proper upgrade of changed CRDs would require them to be removed (manually) before using e.g. Helm to install new ones...

(Helm project has lot of tickets about that, and a long document about the corner-cases that are the reason why Helm tool chooses not to support CRD removal/upgrades.)

tkatila · 2024-01-29T12:25:25Z

The CRD install doesn't seem to be the issue. The failure would be different.

Llyr95 · 2024-01-29T16:54:01Z

Ok so I did some testing

I tried to install the GPU CR manually like @tkatila said and I had the same error
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "mgpudeviceplugin.kb.io": failed to call webhook: Post "https://inteldeviceplugins-webhook-service.system.svc:443/mutate-deviceplugin-intel-com-v1-gpudeviceplugin?timeout=10s": context deadline exceeded

After removing everything, I tested helm install --wait operator intel/intel-device-plugins-operator --version=v0.28.0 -n system && helm install gpu intel/intel-device-plugins-gpu --set nodeFeatureRule=true --version=v0.28.0 -n system and again, the same problem.

At one point, I tried your first command to check (helm install operator intel/intel-device-plugins-operator && helm install gpu intel/intel-device-plugins-gpu --set nodeFeatureRule=true) and.... it worked

I didn't understand why so I figured it was because I forgot the version (I am on k8s v1.28.9) but it didn't make sense because if it would not work, it would be doing versions 0.29.0 on k8s 1.28.9 and not the other way around.

Afterwards, I found out that if I installed the operator in the system namespace, then I would have the error described. I still don't know why, it may be something in my configurations that I overlooked and I will dig deeper on that.

Thank you very much for your help!

tkatila · 2024-01-30T11:52:24Z

Good that you got it working!

I don't understand why 'system' ns would cause the webhook to break. We typically use 'intel' or 'inteldeviceplugins' ns without issues.

As you are using Talos, have you decreased the pod-security for the default namespace? I recall that Talos has quite strict pod-security settings by default that can cause the issues with Pods not running. I wouldn't be surprised if there were some network access limitations as well.

mythi · 2024-02-14T12:18:40Z

In intel/helm-charts#46 we showed that the namespace does not matter. Anyway, closing.

mythi closed this as completed Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intel device plugins gpu : failed to call webhook, context deadline exceeded #1658

intel device plugins gpu : failed to call webhook, context deadline exceeded #1658

Llyr95 commented Jan 28, 2024

tkatila commented Jan 29, 2024

Llyr95 commented Jan 29, 2024

tkatila commented Jan 29, 2024

Llyr95 commented Jan 29, 2024

tkatila commented Jan 29, 2024

eero-t commented Jan 29, 2024

tkatila commented Jan 29, 2024

Llyr95 commented Jan 29, 2024

tkatila commented Jan 30, 2024

mythi commented Feb 14, 2024

intel device plugins gpu : failed to call webhook, context deadline exceeded #1658

intel device plugins gpu : failed to call webhook, context deadline exceeded #1658

Comments

Llyr95 commented Jan 28, 2024

tkatila commented Jan 29, 2024

Llyr95 commented Jan 29, 2024

tkatila commented Jan 29, 2024

Llyr95 commented Jan 29, 2024

tkatila commented Jan 29, 2024

eero-t commented Jan 29, 2024

tkatila commented Jan 29, 2024

Llyr95 commented Jan 29, 2024

tkatila commented Jan 30, 2024

mythi commented Feb 14, 2024