Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intel device plugins gpu : failed to call webhook, context deadline exceeded #1658

Closed
Llyr95 opened this issue Jan 28, 2024 · 10 comments
Closed

Comments

@Llyr95
Copy link

Llyr95 commented Jan 28, 2024

Describe the support request
I am trying to install the intel device plugins gpu helm after installing the operator helm chart. It fails with

Helm install failed for release system/intel-device-plugin-gpu with chart [email protected]: 1 error occurred: * Internal error occurred: failed calling webhook "mgpudeviceplugin.kb.io": failed to call webhook: Post "https://inteldeviceplugins-webhook-service.system.svc:443/mutate-deviceplugin-intel-com-v1-gpudeviceplugin?timeout=10s": context deadline exceeded

The helm chart is installed as a helmrelease via flux.

Thanks for you help

System (please complete the following information if applicable):

  • OS version: Talos v1.6.1
  • Device plugins version: v0.28.0
  • Hardware info: 3 hp prodesk mini pc gen 4
@tkatila
Copy link
Contributor

tkatila commented Jan 29, 2024

Hi @Llyr95

The webhook takes some time to get up, so if you try to install the CR "too soon" it may fail. Or maybe the webhook part of the operator is misbehaving.

Can you check if the controller-manager pod is fully up and running? If it isn't, can you share the logs:
kubectl logs -n <namespace> inteldeviceplugins-controller-manager-something-anything -c kube-rbac-proxy

@Llyr95
Copy link
Author

Llyr95 commented Jan 29, 2024

Hi @tkatila,

Thank you for answering

Here is the logs you asked for :

I0128 16:55:26.266275       1 flags.go:64] FLAG: --add-dir-header="false"
I0128 16:55:26.266322       1 flags.go:64] FLAG: --allow-paths="[]"
I0128 16:55:26.266328       1 flags.go:64] FLAG: --alsologtostderr="false"
I0128 16:55:26.266331       1 flags.go:64] FLAG: --auth-header-fields-enabled="false"
I0128 16:55:26.266335       1 flags.go:64] FLAG: --auth-header-groups-field-name="x-remote-groups"
I0128 16:55:26.266340       1 flags.go:64] FLAG: --auth-header-groups-field-separator="|"
I0128 16:55:26.266343       1 flags.go:64] FLAG: --auth-header-user-field-name="x-remote-user"
I0128 16:55:26.266346       1 flags.go:64] FLAG: --auth-token-audiences="[]"
I0128 16:55:26.266350       1 flags.go:64] FLAG: --client-ca-file=""
I0128 16:55:26.266353       1 flags.go:64] FLAG: --config-file=""
I0128 16:55:26.266355       1 flags.go:64] FLAG: --help="false"
I0128 16:55:26.266359       1 flags.go:64] FLAG: --ignore-paths="[]"
I0128 16:55:26.266362       1 flags.go:64] FLAG: --insecure-listen-address=""
I0128 16:55:26.266365       1 flags.go:64] FLAG: --kubeconfig=""
I0128 16:55:26.266368       1 flags.go:64] FLAG: --log-backtrace-at=":0"
I0128 16:55:26.266373       1 flags.go:64] FLAG: --log-dir=""
I0128 16:55:26.266376       1 flags.go:64] FLAG: --log-file=""
I0128 16:55:26.266379       1 flags.go:64] FLAG: --log-file-max-size="1800"
I0128 16:55:26.266382       1 flags.go:64] FLAG: --log-flush-frequency="5s"
I0128 16:55:26.266385       1 flags.go:64] FLAG: --logtostderr="true"
I0128 16:55:26.266388       1 flags.go:64] FLAG: --oidc-ca-file=""
I0128 16:55:26.266391       1 flags.go:64] FLAG: --oidc-clientID=""
I0128 16:55:26.266394       1 flags.go:64] FLAG: --oidc-groups-claim="groups"
I0128 16:55:26.266397       1 flags.go:64] FLAG: --oidc-groups-prefix=""
I0128 16:55:26.266399       1 flags.go:64] FLAG: --oidc-issuer=""
I0128 16:55:26.266402       1 flags.go:64] FLAG: --oidc-sign-alg="[RS256]"
I0128 16:55:26.266408       1 flags.go:64] FLAG: --oidc-username-claim="email"
I0128 16:55:26.266411       1 flags.go:64] FLAG: --one-output="false"
I0128 16:55:26.266414       1 flags.go:64] FLAG: --proxy-endpoints-port="0"
I0128 16:55:26.266417       1 flags.go:64] FLAG: --secure-listen-address="0.0.0.0:8443"
I0128 16:55:26.266420       1 flags.go:64] FLAG: --skip-headers="false"
I0128 16:55:26.266423       1 flags.go:64] FLAG: --skip-log-headers="false"
I0128 16:55:26.266426       1 flags.go:64] FLAG: --stderrthreshold="2"
I0128 16:55:26.266429       1 flags.go:64] FLAG: --tls-cert-file=""
I0128 16:55:26.266432       1 flags.go:64] FLAG: --tls-cipher-suites="[TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305]"
I0128 16:55:26.266440       1 flags.go:64] FLAG: --tls-min-version="VersionTLS12"
I0128 16:55:26.266444       1 flags.go:64] FLAG: --tls-private-key-file=""
I0128 16:55:26.266446       1 flags.go:64] FLAG: --tls-reload-interval="1m0s"
I0128 16:55:26.266451       1 flags.go:64] FLAG: --upstream="http:https://127.0.0.1:8080/"
I0128 16:55:26.266454       1 flags.go:64] FLAG: --upstream-ca-file=""
I0128 16:55:26.266457       1 flags.go:64] FLAG: --upstream-client-cert-file=""
I0128 16:55:26.266460       1 flags.go:64] FLAG: --upstream-client-key-file=""
I0128 16:55:26.266463       1 flags.go:64] FLAG: --upstream-force-h2c="false"
I0128 16:55:26.266466       1 flags.go:64] FLAG: --v="10"
I0128 16:55:26.266469       1 flags.go:64] FLAG: --version="false"
I0128 16:55:26.266473       1 flags.go:64] FLAG: --vmodule=""
W0128 16:55:26.266730       1 kube-rbac-proxy.go:152] 
==== Deprecation Warning ======================

Insecure listen address will be removed.
Using --insecure-listen-address won't be possible!

The ability to run kube-rbac-proxy without TLS certificates will be removed.
Not using --tls-cert-file and --tls-private-key-file won't be possible!

For more information, please go to https://github.com/brancz/kube-rbac-proxy/issues/187

===============================================

		
I0128 16:55:26.266757       1 kube-rbac-proxy.go:272] Valid token audiences: 
I0128 16:55:26.266788       1 kube-rbac-proxy.go:363] Generating self signed cert as no cert is provided
I0128 16:55:26.435337       1 kube-rbac-proxy.go:414] Starting TCP socket on 0.0.0.0:8443
I0128 16:55:26.435488       1 kube-rbac-proxy.go:421] Listening securely on 0.0.0.0:8443

@tkatila
Copy link
Contributor

tkatila commented Jan 29, 2024

Thanks, the logs seem ok.

If you try to re-apply the GPU CR, does it still fail?

@Llyr95
Copy link
Author

Llyr95 commented Jan 29, 2024

I have tried to reinstall the gpu plugin but I don't understand about the CR, from my testing, the custom resources definitions are installed with the operator helm charts. I install the gpu plugin helm charts after.

So how can I create the custom resources definitions of the operator before it creates the Webhook ? Or there is something I don't understand

I have tried running the daemonset via kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/gpu_plugin/overlays/nfd_labeled_nodes?ref=<RELEASE_VERSION>' from https://intel.github.io/intel-device-plugins-for-kubernetes/cmd/gpu_plugin/README.html#install-with-nfd and that worked. However the goal is to install the operator and gpu plugin helm charts

@tkatila
Copy link
Contributor

tkatila commented Jan 29, 2024

You can't really change the creation order. The operator chart creates the CRDs and the GPU plugin chart initiates a CR.

The reason why I asked about the re-creation is timing. The operator helm chart installs the CRDs and the operator Pod. But helm doesn't (unless asked) wait for the Pods to become available. The webhook especially takes some time to come up and if the GPU CR is deployed during that time, it will fail.

With helm cli, if you install the operator and the gpu back-to-back:
helm install operator intel/intel-device-plugins-operator && helm install gpu intel/intel-device-plugins-gpu --set nodeFeatureRule=true
The second part might fail as the webhook is not yet running.

The fix for this is to tell the helm cli to wait for the deployment (--wait):
helm install --wait operator intel/intel-device-plugins-operator && helm install gpu intel/intel-device-plugins-gpu --set nodeFeatureRule=true

I'm not familiar with flux so I don't know which way it functions.

Another thing to try is, when the GPU CR part has failed, wait a few seconds and try to create the GPU CR from the device plugins project:
curl 'https://raw.githubusercontent.com/intel/intel-device-plugins-for-kubernetes/v0.28.0/deployments/operator/samples/deviceplugin_v1_gpudeviceplugin.yaml' | kubectl create -f -

If the creation succeeds, then the underlying issue is about timing. If it still fails, it's something related to the environment which requires more debug.

@eero-t
Copy link
Contributor

eero-t commented Jan 29, 2024

I have tried to reinstall the gpu plugin but I don't understand about the CR, from my testing, the custom resources definitions are installed with the operator helm charts. I install the gpu plugin helm charts after.

I'm not sure whether it's relevant here (Tuomas?), but Helm tool supports only (initial) CRD install, not CRD upgrades. AFAIK proper upgrade of changed CRDs would require them to be removed (manually) before using e.g. Helm to install new ones...

(Helm project has lot of tickets about that, and a long document about the corner-cases that are the reason why Helm tool chooses not to support CRD removal/upgrades.)

@tkatila
Copy link
Contributor

tkatila commented Jan 29, 2024

The CRD install doesn't seem to be the issue. The failure would be different.

@Llyr95
Copy link
Author

Llyr95 commented Jan 29, 2024

Ok so I did some testing

I tried to install the GPU CR manually like @tkatila said and I had the same error
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "mgpudeviceplugin.kb.io": failed to call webhook: Post "https://inteldeviceplugins-webhook-service.system.svc:443/mutate-deviceplugin-intel-com-v1-gpudeviceplugin?timeout=10s": context deadline exceeded

After removing everything, I tested helm install --wait operator intel/intel-device-plugins-operator --version=v0.28.0 -n system && helm install gpu intel/intel-device-plugins-gpu --set nodeFeatureRule=true --version=v0.28.0 -n system and again, the same problem.

At one point, I tried your first command to check (helm install operator intel/intel-device-plugins-operator && helm install gpu intel/intel-device-plugins-gpu --set nodeFeatureRule=true) and.... it worked

I didn't understand why so I figured it was because I forgot the version (I am on k8s v1.28.9) but it didn't make sense because if it would not work, it would be doing versions 0.29.0 on k8s 1.28.9 and not the other way around.

Afterwards, I found out that if I installed the operator in the system namespace, then I would have the error described. I still don't know why, it may be something in my configurations that I overlooked and I will dig deeper on that.

Thank you very much for your help!

@tkatila
Copy link
Contributor

tkatila commented Jan 30, 2024

Good that you got it working!

I don't understand why 'system' ns would cause the webhook to break. We typically use 'intel' or 'inteldeviceplugins' ns without issues.

As you are using Talos, have you decreased the pod-security for the default namespace? I recall that Talos has quite strict pod-security settings by default that can cause the issues with Pods not running. I wouldn't be surprised if there were some network access limitations as well.

@mythi
Copy link
Contributor

mythi commented Feb 14, 2024

In intel/helm-charts#46 we showed that the namespace does not matter. Anyway, closing.

@mythi mythi closed this as completed Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants