-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes namespace watches sometimes hang indefinitely #83
Comments
Currently, the k8s namer builds a single client at configuration-time. In theory, this is fine, but we have observed odd hanging behavior, as described by #83. Anecdotally, establishing a new client for each namespace has resolved this issue in the past. This does not provide a real _fix_ for the issue, though it may offer some relief for the issue while we establish a more robust fix.
Currently, the k8s namer builds a single client at configuration-time. In theory, this is fine, but we have observed odd hanging behavior, as described by #83. Anecdotally, establishing a new client for each namespace has resolved this issue in the past. This does not provide a real _fix_ for the issue, though it may offer some relief for the issue while we establish a more robust fix.
Currently, the k8s namer builds a single client at configuration-time. In theory, this is fine, but we have observed odd hanging behavior, as described by #83. Anecdotally, establishing a new client for each namespace has resolved this issue in the past. This does not provide a real _fix_ for the issue, though it may offer some relief for the issue while we establish a more robust fix.
This issue no longer manifests following #137, though an underlying bug may still exist. |
Reported issue to finagle: |
I'm still seeing this issue with 0.3.0. Any tips on the best way to debug this? |
@jimmidyson if you run linkerd with the If you can grab a snapshot of the stats in on :9990/admin/metrics.json when the process is in this state, that would be helpful, too. Also, we'll be better able to help if you share your linkerd config (especially your dtab) -- this is a common source of confusion. Finally, would you mind explaining how this state arises? What version of kubernetes you're running, etc? Thanks! |
Thanks for the tips. Very possible this is just a misunderstanding about the dtab - have found it slightly confusing so have tried to make it as simple as possible: admin:
port: 9990
namers:
- kind: io.l5d.experimental.k8s
host: localhost
port: 8443
tls: true
tlsWithoutValidation: true
authTokenFile: /home/jdyson/tmp/token
routers:
- protocol: http
baseDtab: |
/http/1.1/GET => /io.l5d.k8s/zipkin/http;
label: int
servers:
- port: 4140
ip: 0.0.0.0 Running Here's the (hopefully) relevant part of the logs:
Here's the output from :9990/admin/metrics.json: https://gist.github.com/jimmidyson/c05a0e7959b52a5dde50bb2837ea3782 If it does turn out to be config problems, hanging isn't great of course. |
Sounds like a proper bug. Reopening and assigning to Oliver. |
After some internal discussion, Alex will take a look at this. |
Hi @jimmidyson, thanks for the info. This behavior can happen if linkerd in unable to connect to the k8s API. We've got a couple of issues tracking making this situation easier to debug including adding better logging (#272) and adding a timeout for name resolution (#273). In your particular case it looks like you've got the k8s namer configured to talk to localhost. Do you have the k8s master running on localhost? Or are you running through a proxy? In the case of a proxy, it's likely that TLS won't work and you'll need to turn that off in your config. Can you check if you're able to hit the k8s API manually on localhost:8443 over https? |
@adleong I'm running this just for dev, local builds from master & running locally, & I have Kubernetes (well OpenShift actually) running locally too, accessible at https://localhost:8443. TLS works but of course with cert validation disabled (as I tried to do with the config). So yeah Trying to debug through the namer but being a bit of a n00b to scala it makes it a bit difficult. One thing I have noticed is that the filters don't seem to work properly. I added a quick System.out (yeah, dirty but effective :-b) to both the SetHostFilter & AuthFilter & the SetHostFilter doesn't get called for some reason:
When I swap round the order of
So it could be something odd in the filter stack going on? |
So testing the calls via curl:
works fine so I know both the API server URL & token file are correct. |
thanks for digging into this, @jimmidyson! Agreed that timeouts are important, we're adding them soon. As for the behavior you discovered with the filters: that's very strange. The behavior you found suggests that something is going wrong in the AuthFilter, but if you take a look at the contents of that filter, it's pretty simple. We're going to try setting up kuberenetes locally to see if we can reproduce the issue. In the meantime, I encourage you to keep digging and let us know if you find anything. |
Could you also post the full linkerd log when running with |
I've turned off auth in the API server & removed the auth filter... & querying works... I cannot explain that at all as you said AuthFilter is ridiculously simple. Disabling auth in API server & leaving the auth filter in place hangs as before. |
It's this line in AuthFilter that's hanging:
|
Here's the full log. |
I've found a potential explanation. Does your auth token file end with a newline? Attempting to set a header value with a new line causes an exception. Unfortunately, that exception is getting swallowed and not logged today, which makes this kind of thing very hard to debug. We'll definitely fix that in #272. In the meantime, try stripping out any newlines from your token file and see if that helps. |
You've got it - new line removed & works great. Well it's a newline char without a new line (just learnt about the eol option in vim). That is really annoying - sorry for the noise (but yeah something in the logs would be great). |
Amazing. |
Glad to hear it! This definitely showcases the need for better error logging. Stay tuned. |
Thanks for your help figuring this one out - now I can get on & play more :) |
Each watch on the Kubernetes API made by the io.l5d.k8s namer uses a separate finagle client. This causes stat name conflicts which results in incorrect reporting of client stats. It may also have an adverse affect on memory usage. We switch the io.l5d.k8s namer to use a single finagle client for all watches on the Kubernetes API. This isolation of watches to separate finagle clients was originally done to work around #83. However, since then, the [underlying issue](twitter/finagle#471) has been [fixed in finagle](twitter/finagle#481) making this no longer necessary. Signed-off-by: Alex Leong <[email protected]>
* adding RBAC stuff for Prometheus, fixes linkerd/linkerd2#82 Signed-off-by: Christopher Schmidt [email protected] * fixed ident Signed-off-by: Christopher Schmidt [email protected]
We've observed a bug in the kubernetes namer where name resolution (and requests) hang indefinitely when initializing a watch on a second namespace.
For example, in a test cluster, we launch an application into the
prod
namespace and do continuous deployment to apre-prod
namespace that may be accessed with an override.The configuration contains:
and a base dtab like:
Requests are routed to the pre-prod namespace with an override like:
Initially, we observe requests through prod operating as expected, resolving through the k8s namer:
And we see no further k8s logging after this. Furthermore, because the namer never satisfies this lookup, the request never succeeds and remains pending until the client disconnects
Something appears to be wrong in the kubernetes namer that prevents this second watch from succeeding.
The text was updated successfully, but these errors were encountered: