Kubernetes namespace watches sometimes hang indefinitely #83

olix0r · 2016-02-09T20:27:14Z

We've observed a bug in the kubernetes namer where name resolution (and requests) hang indefinitely when initializing a watch on a second namespace.

For example, in a test cluster, we launch an application into the prod namespace and do continuous deployment to a pre-prod namespace that may be accessed with an override.

The configuration contains:

namers:
- kind: io.l5d.experimental.k8s
  prefix: /ns
  tlsWithoutValidation: true
  authTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token

and a base dtab like:

  /iface => /io.l5d.k8s/prod;
  /method => /$/io.buoyant.http.anyMethodPfx/iface ;
  /http/1.1 => /method ;
  ...

Requests are routed to the pre-prod namespace with an override like:

/iface => /io.l5d.k8s/pre-prod | /$/fail;

Initially, we observe requests through prod operating as expected, resolving through the k8s namer:

D 0126 23:45:28.991 THREAD29: k8s initializing service0
...
D 0126 23:45:29.327 THREAD30: k8s ns prod initial state: service0
...
D 0126 23:45:42.955 THREAD30: k8s lookup: /pre-prod/http/service0 /pre-prod/http/service0
D 0126 23:45:42.956 THREAD30: k8s initializing pre-prod

And we see no further k8s logging after this. Furthermore, because the namer never satisfies this lookup, the request never succeeds and remains pending until the client disconnects

Something appears to be wrong in the kubernetes namer that prevents this second watch from succeeding.

The text was updated successfully, but these errors were encountered:

Currently, the k8s namer builds a single client at configuration-time. In theory, this is fine, but we have observed odd hanging behavior, as described by #83. Anecdotally, establishing a new client for each namespace has resolved this issue in the past. This does not provide a real _fix_ for the issue, though it may offer some relief for the issue while we establish a more robust fix.

siggy · 2016-02-26T06:01:26Z

This issue no longer manifests following #137, though an underlying bug may still exist.

siggy · 2016-02-29T20:56:19Z

Reported issue to finagle:
twitter/finagle#471

jimmidyson · 2016-04-10T08:54:42Z

I'm still seeing this issue with 0.3.0. Any tips on the best way to debug this?

olix0r · 2016-04-11T16:03:00Z

@jimmidyson if you run linkerd with the -log.level=DEBUG and -com.twitter.finagle.tracing.debugTrace=true flags, your logs should be a bit more verbose.

If you can grab a snapshot of the stats in on :9990/admin/metrics.json when the process is in this state, that would be helpful, too.

Also, we'll be better able to help if you share your linkerd config (especially your dtab) -- this is a common source of confusion.

Finally, would you mind explaining how this state arises? What version of kubernetes you're running, etc?

Thanks!

jimmidyson · 2016-04-11T21:39:42Z

Thanks for the tips. Very possible this is just a misunderstanding about the dtab - have found it slightly confusing so have tried to make it as simple as possible:

admin:                                                                          
  port: 9990                                                                    

namers:                                                                         
- kind: io.l5d.experimental.k8s                                                 
  host: localhost                                                               
  port: 8443                                                                    
  tls: true                                                                     
  tlsWithoutValidation: true                                                                                                                                                                  
  authTokenFile: /home/jdyson/tmp/token                                         

routers:                                                                        
- protocol: http                                                                
  baseDtab: |                                                                   
    /http/1.1/GET => /io.l5d.k8s/zipkin/http;                                   
  label: int                                                                    
  servers:                                                                      
  - port: 4140                                                                  
    ip: 0.0.0.0

Running curl -H 'Host: zipkin-web-console' localhost:4140. The service zipkin-web-console does exist in the namespace zipkin with the named port http. It's almost like it's not querying at all.

Here's the (hopefully) relevant part of the logs:

D 0411 21:25:53.631 THREAD19: k8s lookup: /zipkin/http/zipkin-web-console /zipkin/http/zipkin-web-console
D 0411 21:25:53.649 THREAD19: k8s initializing zipkin
0411 21:25:53.652 36e4dfa7003f8c8b.36e4dfa7003f8c8b<:36e4dfa7003f8c8b] BinaryAnnotation(namer.dtab.base,/http/1.1/GET=>/io.l5d.k8s/zipkin/http)
0411 21:25:53.652 36e4dfa7003f8c8b.36e4dfa7003f8c8b<:36e4dfa7003f8c8b] BinaryAnnotation(namer.dtab.local,)
0411 21:25:53.652 36e4dfa7003f8c8b.36e4dfa7003f8c8b<:36e4dfa7003f8c8b] BinaryAnnotation(namer.path,/http/1.1/GET/zipkin-web-console)

Here's the output from :9990/admin/metrics.json: https://gist.github.com/jimmidyson/c05a0e7959b52a5dde50bb2837ea3782

If it does turn out to be config problems, hanging isn't great of course.

wmorgan · 2016-04-11T23:01:56Z

Sounds like a proper bug. Reopening and assigning to Oliver.

wmorgan · 2016-04-11T23:11:00Z

After some internal discussion, Alex will take a look at this.

adleong · 2016-04-12T00:22:10Z

Hi @jimmidyson, thanks for the info. This behavior can happen if linkerd in unable to connect to the k8s API. We've got a couple of issues tracking making this situation easier to debug including adding better logging (#272) and adding a timeout for name resolution (#273).

In your particular case it looks like you've got the k8s namer configured to talk to localhost. Do you have the k8s master running on localhost? Or are you running through a proxy? In the case of a proxy, it's likely that TLS won't work and you'll need to turn that off in your config.

Can you check if you're able to hit the k8s API manually on localhost:8443 over https?

jimmidyson · 2016-04-12T10:16:02Z

@adleong I'm running this just for dev, local builds from master & running locally, & I have Kubernetes (well OpenShift actually) running locally too, accessible at https://localhost:8443. TLS works but of course with cert validation disabled (as I tried to do with the config). So yeah curl https://localhost:8443 works fine. I've actually also tried setting up a little proxy to capture requests & see that no requests reach the proxy on localhost. The timeout for name resolution is really important - it would help with quick diagnosis of issues like this but still in production consumers would want to have a quick response if lookups fail, just like DNS failures.

Trying to debug through the namer but being a bit of a n00b to scala it makes it a bit difficult. One thing I have noticed is that the filters don't seem to work properly. I added a quick System.out (yeah, dirty but effective :-b) to both the SetHostFilter & AuthFilter & the SetHostFilter doesn't get called for some reason:

Authfilter Request("GET /api/v1/namespaces/zipkin/endpoints", from 0.0.0.0/0.0.0.0:0)

When I swap round the order of filtered calls in namer/k8s/src/main/scala/io/l5d/k8s.scala it does actually call the SetHostFilter too:

SetHostFilter Request("GET /api/v1/namespaces/zipkin/endpoints", from 0.0.0.0/0.0.0.0:0)
Authfilter Request("GET /api/v1/namespaces/zipkin/endpoints", from 0.0.0.0/0.0.0.0:0)

So it could be something odd in the filter stack going on?

jimmidyson · 2016-04-12T11:05:26Z

So testing the calls via curl:

curl -k -H "Authorization: Bearer $(cat ~/tmp/token)" https://localhost:8443/api/v1/namespaces/zipkin/endpoints

works fine so I know both the API server URL & token file are correct.

adleong · 2016-04-12T16:41:33Z

thanks for digging into this, @jimmidyson!

Agreed that timeouts are important, we're adding them soon. As for the behavior you discovered with the filters: that's very strange. The behavior you found suggests that something is going wrong in the AuthFilter, but if you take a look at the contents of that filter, it's pretty simple.

We're going to try setting up kuberenetes locally to see if we can reproduce the issue. In the meantime, I encourage you to keep digging and let us know if you find anything.

adleong · 2016-04-12T16:51:42Z

Could you also post the full linkerd log when running with -log.level=DEBUG and -com.twitter.finagle.tracing.debugTrace=true and exercising the issue?

jimmidyson · 2016-04-12T18:53:11Z

I've turned off auth in the API server & removed the auth filter... & querying works... I cannot explain that at all as you said AuthFilter is ridiculously simple.

Disabling auth in API server & leaving the auth filter in place hangs as before.

jimmidyson · 2016-04-12T18:56:53Z

It's this line in AuthFilter that's hanging:

req.headerMap("Authorization") = s"Bearer $token"

jimmidyson · 2016-04-12T19:46:52Z

Here's the full log.

adleong · 2016-04-12T21:41:30Z

I've found a potential explanation. Does your auth token file end with a newline? Attempting to set a header value with a new line causes an exception. Unfortunately, that exception is getting swallowed and not logged today, which makes this kind of thing very hard to debug. We'll definitely fix that in #272. In the meantime, try stripping out any newlines from your token file and see if that helps.

jimmidyson · 2016-04-12T22:33:40Z

You've got it - new line removed & works great. Well it's a newline char without a new line (just learnt about the eol option in vim). That is really annoying - sorry for the noise (but yeah something in the logs would be great).

wmorgan · 2016-04-12T22:36:02Z

Amazing.

adleong · 2016-04-12T22:36:43Z

Glad to hear it! This definitely showcases the need for better error logging. Stay tuned.

jimmidyson · 2016-04-12T22:38:53Z

Thanks for your help figuring this one out - now I can get on & play more :)

Each watch on the Kubernetes API made by the io.l5d.k8s namer uses a separate finagle client. This causes stat name conflicts which results in incorrect reporting of client stats. It may also have an adverse affect on memory usage. We switch the io.l5d.k8s namer to use a single finagle client for all watches on the Kubernetes API. This isolation of watches to separate finagle clients was originally done to work around #83. However, since then, the [underlying issue](twitter/finagle#471) has been [fixed in finagle](twitter/finagle#481) making this no longer necessary. Signed-off-by: Alex Leong <[email protected]>

* adding RBAC stuff for Prometheus, fixes linkerd/linkerd2#82 Signed-off-by: Christopher Schmidt [email protected] * fixed ident Signed-off-by: Christopher Schmidt [email protected]

olix0r added the bug label Feb 9, 2016

olix0r added this to the 0.0.11 milestone Feb 9, 2016

olix0r mentioned this issue Feb 9, 2016

replace namer logging with tracing #84

Open

olix0r added the ready label Feb 9, 2016

gtcampbell removed this from the 0.0.12 milestone Feb 22, 2016

olix0r mentioned this issue Feb 25, 2016

Establish a new client for each Kubernetes namespace. #137

Merged

olix0r removed the ready label Feb 25, 2016

gtcampbell added the in progress label Feb 25, 2016

gtcampbell assigned siggy Feb 25, 2016

siggy closed this as completed Feb 29, 2016

siggy removed the in progress label Feb 29, 2016

wmorgan reopened this Apr 11, 2016

wmorgan assigned olix0r and unassigned siggy Apr 11, 2016

wmorgan assigned adleong and unassigned olix0r Apr 11, 2016

gtcampbell added the ready label Apr 12, 2016

adleong closed this as completed Apr 12, 2016

adleong removed the ready label Apr 12, 2016

olix0r unassigned adleong Jan 23, 2017

adleong mentioned this issue Jan 10, 2018

Use a single client for the entire k8s namer #1774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes namespace watches sometimes hang indefinitely #83

Kubernetes namespace watches sometimes hang indefinitely #83

olix0r commented Feb 9, 2016

siggy commented Feb 26, 2016

siggy commented Feb 29, 2016

jimmidyson commented Apr 10, 2016

olix0r commented Apr 11, 2016

jimmidyson commented Apr 11, 2016

wmorgan commented Apr 11, 2016

wmorgan commented Apr 11, 2016

adleong commented Apr 12, 2016

jimmidyson commented Apr 12, 2016

jimmidyson commented Apr 12, 2016

adleong commented Apr 12, 2016

adleong commented Apr 12, 2016

jimmidyson commented Apr 12, 2016

jimmidyson commented Apr 12, 2016

jimmidyson commented Apr 12, 2016

adleong commented Apr 12, 2016

jimmidyson commented Apr 12, 2016

wmorgan commented Apr 12, 2016

adleong commented Apr 12, 2016

jimmidyson commented Apr 12, 2016

Kubernetes namespace watches sometimes hang indefinitely #83

Kubernetes namespace watches sometimes hang indefinitely #83

Comments

olix0r commented Feb 9, 2016

siggy commented Feb 26, 2016

siggy commented Feb 29, 2016

jimmidyson commented Apr 10, 2016

olix0r commented Apr 11, 2016

jimmidyson commented Apr 11, 2016

wmorgan commented Apr 11, 2016

wmorgan commented Apr 11, 2016

adleong commented Apr 12, 2016

jimmidyson commented Apr 12, 2016

jimmidyson commented Apr 12, 2016

adleong commented Apr 12, 2016

adleong commented Apr 12, 2016

jimmidyson commented Apr 12, 2016

jimmidyson commented Apr 12, 2016

jimmidyson commented Apr 12, 2016

adleong commented Apr 12, 2016

jimmidyson commented Apr 12, 2016

wmorgan commented Apr 12, 2016

adleong commented Apr 12, 2016

jimmidyson commented Apr 12, 2016