Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no kind "Deployment" is registered for version "apps\x10v1" in scheme #4543

Open
mlb5000 opened this issue Jun 5, 2024 · 4 comments
Open

Comments

@mlb5000
Copy link

mlb5000 commented Jun 5, 2024

I am suddenly having a very strange issue with my 4 node microk8s cluster. I am unable to list all Deployments in only the default namespace. Other namespaces are fine, as is getting the details of a specific deployment.

This works

$ kubectl get deployment/nffc-worker
NAME          READY   UP-TO-DATE   AVAILABLE   AGE
nffc-worker   3/3     3            3           203d

As does this

$ kubectl get deployments -n userservices
NAME            READY   UP-TO-DATE   AVAILABLE   AGE
invbot-server   2/2     2            2           5d3h

However, as soon as I try to list all deployments in default, I get this error from kubectl

$ kubectl get deployments
Error from server: no kind "Deployment" is registered for version "apps\x10v1" in scheme "pkg/runtime/scheme.go:100"

The kubelite logs show this failure internally

Jun 05 11:22:47 node1 microk8s.daemon-kubelite[1059653]: W0605 11:22:47.834995 1059653 reflector.go:535] storage/cacher.go:/deployments: failed to list *apps.Deployment: no kind "Deployment" is registered for version "apps\x10v1" in scheme "pkg/runtime/scheme.go:100"

Obviously apps\x10v1 is wrong, as it should be apps/v1 (or maybe apps\v1 is acceptable as well). But I cannot figure out where this corrupted value is coming from. How can I figure out which Deployment in my configuration has this corrupted value and repair it?

Worse, because the system can no longer list all deployments, all resources in the default namespace now seem to be frozen. Things like kubectl rollout restart deployment don't finish their restart work, and even explicitly deleting a deployment doesn't remove its pods from the cluster.

@mlb5000
Copy link
Author

mlb5000 commented Jun 5, 2024

I should note that

  1. This affects ALL nodes in the cluster
  2. I have restarted microk8s on all nodes using sudo snap restart microk8s, but it did not fix anything

@mlb5000
Copy link
Author

mlb5000 commented Jun 5, 2024

Ok, so I managed to isolate the corrupted deployment configuration. Somehow there is a corrupted protocol buffer in the dqlite database.

Isolate the corrupted service

On any of the nodes, run

sudo /snap/microk8s/current/bin/dqlite \
  --cert /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt \
  --key /var/snap/microk8s/current/var/kubernetes/backend/cluster.key \
  --servers file:https:////var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml \
  k8s

Then in dqlite run

dqlite> select name from kine where name like '%deployments/default%';

I then copied the deployment names, dropped them in to Sublime text, and created a script with a bunch of lines that look like this:

echo "search" && microk8s kubectl get deployments/search-worker -o yaml | grep "apiVersion:"

This will error on the specific deployment that is causing the problem, and print apiVersion: apps/v1 for everything else.

View the configuration

Back in dqlite, grab the BLOB data for that particular bad registry entry, and the BLOB data for a good record while you're at it. This data is stored as an ASCII buffer.

The bad record's data starts with 107 56 115 0 10 21 10 7 97 112 112 115 16 118 49, the latter part of which reads as apps\x10v1.

The good record's data starts with 107 56 115 0 10 21 10 7 97 112 112 115 47 118 49, the latter part of which reads as apps/v1, which is what we want.

There doesn't appear to be any other corruption in here, but even if there is, it's this first part of this protocol buffer that I need to fix. Then I can just delete and recreate the deployment through the API as expected.

Basically, I either need to patch that 16 with a 47 in the dqlite database, or find a way to remove that Registry entry. However, I'm not sure how to do this in a way where the change will propagate to the other nodes like it's supposed to.

@mlb5000
Copy link
Author

mlb5000 commented Jun 5, 2024

Explicitly deleting that record in the dqlite database unstuck the deployment lifecycle across the entire cluster, and things are now back in working order.

However, someone from the microk8s team should look into this, since it feels very wrong to me that a protocol buffer that has been corrupted should ever find its way into the dqlite database. Especially if this corruption results in completely knocking out basic reliability/recovery functionality.

@mlb5000
Copy link
Author

mlb5000 commented Jun 5, 2024

Basically, the root cause here seems to be the dqlite record being persisted with a resource type + version combination that does not exist in kubectl api-resources.

Feels like the solution here is two-fold

  1. the resource KIND + APIVERSION combination should be validated prior to persistence
  2. apiserver should be updated to be more resilient to record corruption like this. Just because a single deployment record could not be read should not prevent things like list commands from succeeding.

I don't know if microk8s has its own apiserver implementation, or if this issue really belongs in the Kubernetes mainline, but a single corrupted byte in a single record in the dqlite database shouldn't have such an outsized effect on the platform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant