Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kilo with flannel does not set wg endpoints on servers creation #38

Open
carlosrmendes opened this issue Feb 26, 2020 · 7 comments
Open

Comments

@carlosrmendes
Copy link

I'm using k3s with flannel and kilo as an add-on, using in-cluster authorization without mounting kubeconfig and with the following args:

--cni=false
--local=false
--encapsulate=never
--subnet=10.40.0.0/16
--hostname=$(NODE_NAME)
--compatibility=flannel

I have a master server in location A and a node in location B.
The master has a public IP and the node is behind a NAT.

I'm setting the following annotations on master:

kilo.squat.ai/force-endpoint: <master_public_ip>:51820
kilo.squat.ai/leader: "true"
kilo.squat.ai/location: A

And these on node:

kilo.squat.ai/force-endpoint: <nat_public_ip>:51820
kilo.squat.ai/persistent-keepalive=5
kilo.squat.ai/location: B

I'm booting the master, then the node and after both servers become ready I don't see the related endpoints of each other on each host (using wg command).

Checking the logs in the kilo on master I see at the end those connection timed out :

{"caller":"mesh.go:219","component":"kilo","level":"warn","msg":"no private key found on disk; generating one now","ts":"2020-02-26T10:06:11.347867897Z"}      {"caller":"main.go:217","msg":"Starting Kilo network mesh '12220b790da5ab7fbdcfb1db9d899bec9602261e-dirty'.","ts":"2020-02-26T10:06:12.017759649Z"}
E0226 10:06:12.145623       1 reflector.go:126] pkg/k8s/backend.go:391: Failed to list *v1alpha1.Peer: the server could not find the requested resource (get peers.kilo.squat.ai)
{"caller":"mesh.go:664","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-02-26T10:06:13.609772774Z"}
E0226 10:12:51.783433       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.30.2.83:48064->10.43.0.1:443: read: connection timed out
E0226 10:13:20.455365       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.30.2.83:48062->10.43.0.1:443: read: connection timed out

The :1107/health in kilo on master responds with 200 OK.
The logs in the kilo on node seems ok:

{"caller":"mesh.go:219","component":"kilo","level":"warn","msg":"no private key found on disk; generating one now","ts":"2020-02-26T10:18:01.36172736Z"}
{"caller":"main.go:217","msg":"Starting Kilo network mesh '12220b790da5ab7fbdcfb1db9d899bec9602261e-dirty'.","ts":"2020-02-26T10:18:01.387390299Z"}
{"caller":"mesh.go:664","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-02-26T10:18:01.853855818Z"}

To resolve this I had to delete the kilo pod on master and the new pod could configured correctly the wg endpoint.

@squat
Copy link
Owner

squat commented Feb 26, 2020

Hmm what do you mean exactly by "on boot"? Pods don't have any concept of booting. When you deleted the Pod and it started up again, the endpoint worked?

Did you set the annotations before of after launching Kilo? That may help us find the bug.

Also, it is strange that of all nodes, it was the master node having trouble connecting to the API. Note that Kilo uses the node's host network so if kilo on the master could not contact the API then nothing on the master's host network could.

@carlosrmendes
Copy link
Author

carlosrmendes commented Feb 26, 2020

On boot I mean on servers creation and yes, if I delete the pod and it started up again the endpoint appears on each server and works.

I set all the annotations before running kilo. You recommend setting after? Setting after leads to the same behavior when I delete the pod, the annotations are already present.

I tested again, with debug log level, those messages did not appear this time (so I think it was just a connection problem at the start) but the endpoint is still not appearing on each servers.

kilo on master (location A):

{"caller":"mesh.go:219","component":"kilo","level":"warn","msg":"no private key found on disk; generating one now","ts":"2020-02-26T11:23:42.289243134Z"}      {"caller":"mesh.go:253","component":"kilo","level":"debug","msg":"using 10.30.3.145/24 as the private IP address","ts":"2020-02-26T11:23:42.346224611Z"}       {"caller":"mesh.go:254","component":"kilo","level":"debug","msg":"using 10.30.3.145/24 as the public IP address","ts":"2020-02-26T11:23:42.346302482Z"}        {"caller":"main.go:217","msg":"Starting Kilo network mesh '12220b790da5ab7fbdcfb1db9d899bec9602261e-dirty'.","ts":"2020-02-26T11:23:42.349515366Z"}
E0226 11:23:42.472142       1 reflector.go:126] pkg/k8s/backend.go:391: Failed to list *v1alpha1.Peer: the server could not find the requested resource (get peers.kilo.squat.ai)
{"caller":"mesh.go:410","component":"kilo","event":"add","level":"debug","msg":"syncing nodes","ts":"2020-02-26T11:23:43.571364151Z"}
{"caller":"mesh.go:412","component":"kilo","event":"add","level":"debug","msg":"processing local node","node":{"Endpoint":{"DNS":"","IP":"35.158.113.201","Port":51820},"Key":"","InternalIP":null,"LastSeen":0,"Leader":true,"Location":"cloud","Name":"ip-10-30-3-145","PersistentKeepalive":0,"Subnet":{"IP":"10.42.0.0","Mask":"https:////AA=="},"WireGuardIP":null},"ts":"2020-02-26T11:23:43.571425457Z"}
{"caller":"mesh.go:534","component":"kilo","level":"debug","msg":"local node differs from backend","ts":"2020-02-26T11:23:43.571543518Z"}
{"caller":"mesh.go:540","component":"kilo","level":"debug","msg":"successfully reconciled local node against backend","ts":"2020-02-26T11:23:43.601396712Z"}   {"caller":"mesh.go:664","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-02-26T11:23:43.606246649Z"}
{"caller":"mesh.go:410","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-02-26T11:23:43.667193852Z"}
{"caller":"mesh.go:412","component":"kilo","event":"update","level":"debug","msg":"processing local node","node":{"Endpoint":{"DNS":"","IP":"35.158.113.201","Port":51820},"Key":"UzhQNktNR09QaDVteXFSWUIyR1kxa0VDSUNMWXZJUVZyd0EycFJJQU5SOD0=","InternalIP":{"IP":"10.30.3.145","Mask":"https:////AA=="},"LastSeen":1582716223,"Leader":true,"Location":"cloud","Name":"ip-10-30-3-145","PersistentKeepalive":0,"Subnet":{"IP":"10.42.0.0","Mask":"https:////AA=="},"WireGuardIP":null},"ts":"2020-02-26T11:23:43.667236817Z"}
{"caller":"mesh.go:534","component":"kilo","level":"debug","msg":"local node differs from backend","ts":"2020-02-26T11:23:43.667271722Z"}
{"caller":"mesh.go:540","component":"kilo","level":"debug","msg":"successfully reconciled local node against backend","ts":"2020-02-26T11:23:43.703502148Z"}   {"caller":"mesh.go:410","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-02-26T11:23:43.709604249Z"}
{"caller":"mesh.go:412","component":"kilo","event":"update","level":"debug","msg":"processing local node","node":{"Endpoint":{"DNS":"","IP":"35.158.113.201","Port":51820},"Key":"UzhQNktNR09QaDVteXFSWUIyR1kxa0VDSUNMWXZJUVZyd0EycFJJQU5SOD0=","InternalIP":{"IP":"10.30.3.145","Mask":"https:////AA=="},"LastSeen":1582716223,"Leader":true,"Location":"cloud","Name":"ip-10-30-3-145","PersistentKeepalive":0,"Subnet":{"IP":"10.42.0.0","Mask":"https:////AA=="},"WireGuardIP":{"IP":"10.40.0.1","Mask":"https://8AAA=="}},"ts":"2020-02-26T11:23:43.709669493Z"}
{"caller":"mesh.go:410","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-02-26T11:23:55.728253064Z"}
{"caller":"mesh.go:412","component":"kilo","event":"update","level":"debug","msg":"processing local node","node":{"Endpoint":{"DNS":"","IP":"35.158.113.201","Port":51820},"Key":"UzhQNktNR09QaDVteXFSWUIyR1kxa0VDSUNMWXZJUVZyd0EycFJJQU5SOD0=","InternalIP":{"IP":"10.30.3.145","Mask":"https:////AA=="},"LastSeen":1582716223,"Leader":true,"Location":"cloud","Name":"ip-10-30-3-145","PersistentKeepalive":0,"Subnet":{"IP":"10.42.0.0","Mask":"https:////AA=="},"WireGuardIP":{"IP":"10.40.0.1","Mask":"https://8AAA=="}},"ts":"2020-02-26T11:23:55.728303646Z"}

@carlosrmendes carlosrmendes changed the title kilo with flannel does not set wg endpoints on boot kilo with flannel does not set wg endpoints on servers creation Feb 26, 2020
@carlosrmendes
Copy link
Author

carlosrmendes commented Feb 26, 2020

After more testing, either clusters creation and clusters failover (delete nodes and boot new ones) those messages appears some time and the kilo does not recover it self, but keeps the health check ok. If I delete the kilo pods everything is configured correctly.

@carlosrmendes
Copy link
Author

carlosrmendes commented Feb 26, 2020

I figured it out that after ~10min the kilo recovers from that timeout:

E0226 12:07:01.409656       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.30.1.67:43406->10.43.0.1:443: read: connection timed out
{"caller":"mesh.go:502","component":"kilo","error":"failed to set local node: failed to patch node: Patch https://10.43.0.1:443/api/v1/nodes/ip-10-30-1-67: read tcp 10.30.1.67:43410-\u003e10.43.0.1:443: read: connection timed out","level":"error","node":{"Endpoint":{"DNS":"","IP":"18.184.216.118","Port":51820},"Key":"dWdydWQ4bC9Jb1BaUEJrUkx6L0JYQWJWQVNuRXlkb2pTTUNMYVFCaUYyVT0=","InternalIP":{"IP":"10.30.1.67","Mask":"https:////AA=="},"LastSeen":1582718535,"Leader":true,"Location":"cloud","Name":"ip-10-30-1-67","PersistentKeepalive":0,"Subnet":{"IP":"10.42.8.0","Mask":"https:////AA=="},"WireGuardIP":{"IP":"10.40.0.1","Mask":"https://8AAA=="}},"ts":"2020-02-26T12:17:58.817633498Z"}
....

After that the configuration is set correctly. It is possible to set that timeout?

Edit: this happened on the new booted master server after I deleted all the k3s masters servers

@squat
Copy link
Owner

squat commented Feb 26, 2020

Setting the labels before of after should be completely equivalent and it is not particularly recommended one way or another. I was asking to find a possible avenue for debugging, ie maybe it was the re-configuration of the network that caused the bug.

How reproducible is this? If I spin up two k3s+flannel nodes and put Kilo on it I should see the same problem? And then deleting the pods and letting them come back will fix the issue? I'll try to repro and this way find a solution.

Thanks for reporting it :)

@squat
Copy link
Owner

squat commented Mar 4, 2020

@carlosrmendes sorry for my delay in following up. Could you share a minimal reproducible set of steps? Is the key in reproducing the problem booting a new master and deleteing the old masters? I'd like to take a look so I can better understand what the root of the problem is.

I somewhat suspect that the root of the problem here is that the Kubernetes API is not responding or is inaccessible for some reason. Kilo is even failing to list the Peers resource and gets an error from the API; that should not be happening in a working cluster.

Thanks for your help as always, @carlosrmendes 🎉

@carlosrmendes
Copy link
Author

Yes, the steps that I follow up is delete 2 old master nodes and boot 2 new ones, maintaining 2 or 3 worker nodes with kilo running on them. Only the master nodes have public ip and the leader annotation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants