-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k0s upgrade of single node cluster borked it #5287
Comments
Also, I feel I need to mention this: it does not seem to be a general problem — we upgraded another cluster this week (similar versions) and it worked nicely. The difference in configuration is that the other cluster is a multi-node setup that uses etcd for storage/state and Calico for networking. Otherwise, they are identical as in they run on OpenStack, Flatcar Linux, etc.. |
Here's what I understand is the gist of the issue:
Is that correct? To rule out the obvious things: I assume To figure out what's up with the CoreDNS pods, could you maybe try to provide the output of the following:
Also, can you check if you're referencing custom images in your k0s configuration? Please have a look at |
Not exactly — I'll try to explain: the reason why CoreDNS does not work, is that it cannot access the k8s API. But it's not just CoreDNS, basically any pod that needs the k8s API (e.g. CoreDNS, grafana-agent, haproxy-ingress, metrics-server etc.) has the same problem and crashes on start or eventually (depending on how they handle it). The error is always similar to this:
The error message is different from service to service, but ultimately they always fail with an k0s sysinfo(ipv6 is turned off)
kubectl get node
others... for the others, we tried to hack the service discovery by supplying See this as an example:
But I can revert it. events (kube-system)We did not try to patch the metrics server, so it keeps restarting currently:
get pods (kube-system)
Again, we tried to patch pods that needed the k8s api and that we needed. But we didn't patch metrics-server and it's still failing (since Friday). logs metrics-server (kube-system)As an addition, here are logs from the (unpatched) metrics-server:
Just keeps repeating and restarting. |
I have checked — no custom images. |
From what I gathered the I can see it defined, it has one endpoint (the That service IP is used in the SD environment variables in the pod, but the connection does not work - i/o timeout. The same service IP works from the host OS though. I see no requests to it with a tcpdump when the request comes from within a pod (e.g. metrics-server). It's like traffic never leaves the pod? |
@twz123 anything else I can look into? Or any idea what could be the culprit? |
We tried to upgrade a single node cluster yesterday and it ended up in a somewhat borked state. The initial version of k0s was
1.27.5+k0s.0
and we upgraded one by one to e.g.1.28.x
(always latest patch release) all the way to1.31.2
.This is the config (extract from k0sctl's config):
The upgrades were done using k0sctl and all the upgrades seemed to have worked until we got to
1.31.2
. The upgrade hung and eventually errored that two pods inkube-system
were not ready.So, one of the pods at the time was CoreDNS — but before someone jumps on "it's a DNS problem", it did not seem to be.
We had other pods in namespaces failing, one common thing was that they are all using the Kubernetes API (via the
kubernetes
(service) that creates the virtualhttps://10.96.0.0.1
). And all other failures were actually DNS because CoreDNS was not running.The fun thing is, this IP is working from the node itself, but would not work from within a pod. Other service IPs did work (anything that was not this service basically), but of course DNS was also broken because CoreDNS was still in a weird crash loop because it couldn't access the the Kubernetes API using the
10.96.0.0.1
host.I think I read almost anything one can find on Google, the fact that the service is called
kubernetes
makes it of course extra hard to Google.Basically the k8s api would work (from pod and host):
From the host:
https://10.96.0.1:443
✅From a pod:
https://10.96.0.1:443
👎I also verified that other service IPs worked — anything but the
kubernetes
service was working on an IP level.I also looked through Contrack, and see a ton of connections in waiting state. So these then/maybe/probably reflect the timeouts about API calls in the pod logs.
A couple things that we tried (that did not help):
tcpdump -i eth0 dst 10.96.0.0.1
) on the host — nothing, it seems like traffic to the API does not leave the pod, I've also tried to fetch anything from specific pods, and I can generally see traffic, but nothing for the API service)The only thing that sort of helped — not really, but kinda — was patching all deployments and overriding the environment variables that do service discovery. Really ugly hack, but that made CoreDNS run at least. ;)
Anyway, I was hoping anyone else had any insights into this. I still have the node/cluster around to test and prod.
The text was updated successfully, but these errors were encountered: