Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes namespace watches sometimes hang indefinitely #83

Closed
olix0r opened this issue Feb 9, 2016 · 20 comments
Closed

Kubernetes namespace watches sometimes hang indefinitely #83

olix0r opened this issue Feb 9, 2016 · 20 comments
Labels

Comments

@olix0r
Copy link
Member

olix0r commented Feb 9, 2016

We've observed a bug in the kubernetes namer where name resolution (and requests) hang indefinitely when initializing a watch on a second namespace.

For example, in a test cluster, we launch an application into the prod namespace and do continuous deployment to a pre-prod namespace that may be accessed with an override.

The configuration contains:

namers:
- kind: io.l5d.experimental.k8s
  prefix: /ns
  tlsWithoutValidation: true
  authTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token

and a base dtab like:

  /iface => /io.l5d.k8s/prod;
  /method => /$/io.buoyant.http.anyMethodPfx/iface ;
  /http/1.1 => /method ;
  ...

Requests are routed to the pre-prod namespace with an override like:

/iface => /io.l5d.k8s/pre-prod | /$/fail;

Initially, we observe requests through prod operating as expected, resolving through the k8s namer:

D 0126 23:45:28.991 THREAD29: k8s initializing service0
...
D 0126 23:45:29.327 THREAD30: k8s ns prod initial state: service0
...
D 0126 23:45:42.955 THREAD30: k8s lookup: /pre-prod/http/service0 /pre-prod/http/service0
D 0126 23:45:42.956 THREAD30: k8s initializing pre-prod

And we see no further k8s logging after this. Furthermore, because the namer never satisfies this lookup, the request never succeeds and remains pending until the client disconnects

Something appears to be wrong in the kubernetes namer that prevents this second watch from succeeding.

@olix0r olix0r added the bug label Feb 9, 2016
@olix0r olix0r added this to the 0.0.11 milestone Feb 9, 2016
@olix0r olix0r added the ready label Feb 9, 2016
@gtcampbell gtcampbell removed this from the 0.0.12 milestone Feb 22, 2016
olix0r added a commit that referenced this issue Feb 25, 2016
Currently, the k8s namer builds a single client at configuration-time.  In
theory, this is fine, but we have observed odd hanging behavior, as described
by #83.

Anecdotally, establishing a new client for each namespace has resolved this
issue in the past.  This does not provide a real _fix_ for the issue, though it
may offer some relief for the issue while we establish a more robust fix.
olix0r added a commit that referenced this issue Feb 25, 2016
Currently, the k8s namer builds a single client at configuration-time.  In
theory, this is fine, but we have observed odd hanging behavior, as described
by #83.

Anecdotally, establishing a new client for each namespace has resolved this
issue in the past.  This does not provide a real _fix_ for the issue, though it
may offer some relief for the issue while we establish a more robust fix.
@olix0r olix0r removed the ready label Feb 25, 2016
olix0r added a commit that referenced this issue Feb 25, 2016
Currently, the k8s namer builds a single client at configuration-time.  In
theory, this is fine, but we have observed odd hanging behavior, as described
by #83.

Anecdotally, establishing a new client for each namespace has resolved this
issue in the past.  This does not provide a real _fix_ for the issue, though it
may offer some relief for the issue while we establish a more robust fix.
@siggy
Copy link
Member

siggy commented Feb 26, 2016

This issue no longer manifests following #137, though an underlying bug may still exist.

@siggy
Copy link
Member

siggy commented Feb 29, 2016

Reported issue to finagle:
twitter/finagle#471

@siggy siggy closed this as completed Feb 29, 2016
@siggy siggy removed the in progress label Feb 29, 2016
@jimmidyson
Copy link

I'm still seeing this issue with 0.3.0. Any tips on the best way to debug this?

@olix0r
Copy link
Member Author

olix0r commented Apr 11, 2016

@jimmidyson if you run linkerd with the -log.level=DEBUG and -com.twitter.finagle.tracing.debugTrace=true flags, your logs should be a bit more verbose.

If you can grab a snapshot of the stats in on :9990/admin/metrics.json when the process is in this state, that would be helpful, too.

Also, we'll be better able to help if you share your linkerd config (especially your dtab) -- this is a common source of confusion.

Finally, would you mind explaining how this state arises? What version of kubernetes you're running, etc?

Thanks!

@jimmidyson
Copy link

Thanks for the tips. Very possible this is just a misunderstanding about the dtab - have found it slightly confusing so have tried to make it as simple as possible:

admin:                                                                          
  port: 9990                                                                    

namers:                                                                         
- kind: io.l5d.experimental.k8s                                                 
  host: localhost                                                               
  port: 8443                                                                    
  tls: true                                                                     
  tlsWithoutValidation: true                                                                                                                                                                  
  authTokenFile: /home/jdyson/tmp/token                                         

routers:                                                                        
- protocol: http                                                                
  baseDtab: |                                                                   
    /http/1.1/GET => /io.l5d.k8s/zipkin/http;                                   
  label: int                                                                    
  servers:                                                                      
  - port: 4140                                                                  
    ip: 0.0.0.0

Running curl -H 'Host: zipkin-web-console' localhost:4140. The service zipkin-web-console does exist in the namespace zipkin with the named port http. It's almost like it's not querying at all.

Here's the (hopefully) relevant part of the logs:

D 0411 21:25:53.631 THREAD19: k8s lookup: /zipkin/http/zipkin-web-console /zipkin/http/zipkin-web-console
D 0411 21:25:53.649 THREAD19: k8s initializing zipkin
0411 21:25:53.652 36e4dfa7003f8c8b.36e4dfa7003f8c8b<:36e4dfa7003f8c8b] BinaryAnnotation(namer.dtab.base,/http/1.1/GET=>/io.l5d.k8s/zipkin/http)
0411 21:25:53.652 36e4dfa7003f8c8b.36e4dfa7003f8c8b<:36e4dfa7003f8c8b] BinaryAnnotation(namer.dtab.local,)
0411 21:25:53.652 36e4dfa7003f8c8b.36e4dfa7003f8c8b<:36e4dfa7003f8c8b] BinaryAnnotation(namer.path,/http/1.1/GET/zipkin-web-console)

Here's the output from :9990/admin/metrics.json: https://gist.github.com/jimmidyson/c05a0e7959b52a5dde50bb2837ea3782

If it does turn out to be config problems, hanging isn't great of course.

@wmorgan
Copy link
Member

wmorgan commented Apr 11, 2016

Sounds like a proper bug. Reopening and assigning to Oliver.

@wmorgan wmorgan reopened this Apr 11, 2016
@wmorgan wmorgan assigned olix0r and unassigned siggy Apr 11, 2016
@wmorgan
Copy link
Member

wmorgan commented Apr 11, 2016

After some internal discussion, Alex will take a look at this.

@wmorgan wmorgan assigned adleong and unassigned olix0r Apr 11, 2016
@adleong
Copy link
Member

adleong commented Apr 12, 2016

Hi @jimmidyson, thanks for the info. This behavior can happen if linkerd in unable to connect to the k8s API. We've got a couple of issues tracking making this situation easier to debug including adding better logging (#272) and adding a timeout for name resolution (#273).

In your particular case it looks like you've got the k8s namer configured to talk to localhost. Do you have the k8s master running on localhost? Or are you running through a proxy? In the case of a proxy, it's likely that TLS won't work and you'll need to turn that off in your config.

Can you check if you're able to hit the k8s API manually on localhost:8443 over https?

@jimmidyson
Copy link

@adleong I'm running this just for dev, local builds from master & running locally, & I have Kubernetes (well OpenShift actually) running locally too, accessible at https://localhost:8443. TLS works but of course with cert validation disabled (as I tried to do with the config). So yeah curl https://localhost:8443 works fine. I've actually also tried setting up a little proxy to capture requests & see that no requests reach the proxy on localhost. The timeout for name resolution is really important - it would help with quick diagnosis of issues like this but still in production consumers would want to have a quick response if lookups fail, just like DNS failures.

Trying to debug through the namer but being a bit of a n00b to scala it makes it a bit difficult. One thing I have noticed is that the filters don't seem to work properly. I added a quick System.out (yeah, dirty but effective :-b) to both the SetHostFilter & AuthFilter & the SetHostFilter doesn't get called for some reason:

Authfilter Request("GET /api/v1/namespaces/zipkin/endpoints", from 0.0.0.0/0.0.0.0:0)

When I swap round the order of filtered calls in namer/k8s/src/main/scala/io/l5d/k8s.scala it does actually call the SetHostFilter too:

SetHostFilter Request("GET /api/v1/namespaces/zipkin/endpoints", from 0.0.0.0/0.0.0.0:0)
Authfilter Request("GET /api/v1/namespaces/zipkin/endpoints", from 0.0.0.0/0.0.0.0:0)

So it could be something odd in the filter stack going on?

@jimmidyson
Copy link

So testing the calls via curl:

curl -k -H "Authorization: Bearer $(cat ~/tmp/token)" https://localhost:8443/api/v1/namespaces/zipkin/endpoints

works fine so I know both the API server URL & token file are correct.

@adleong
Copy link
Member

adleong commented Apr 12, 2016

thanks for digging into this, @jimmidyson!

Agreed that timeouts are important, we're adding them soon. As for the behavior you discovered with the filters: that's very strange. The behavior you found suggests that something is going wrong in the AuthFilter, but if you take a look at the contents of that filter, it's pretty simple.

We're going to try setting up kuberenetes locally to see if we can reproduce the issue. In the meantime, I encourage you to keep digging and let us know if you find anything.

@adleong
Copy link
Member

adleong commented Apr 12, 2016

Could you also post the full linkerd log when running with -log.level=DEBUG and -com.twitter.finagle.tracing.debugTrace=true and exercising the issue?

@jimmidyson
Copy link

I've turned off auth in the API server & removed the auth filter... & querying works... I cannot explain that at all as you said AuthFilter is ridiculously simple.

Disabling auth in API server & leaving the auth filter in place hangs as before.

@jimmidyson
Copy link

It's this line in AuthFilter that's hanging:

req.headerMap("Authorization") = s"Bearer $token"

@jimmidyson
Copy link

Here's the full log.

@adleong
Copy link
Member

adleong commented Apr 12, 2016

I've found a potential explanation. Does your auth token file end with a newline? Attempting to set a header value with a new line causes an exception. Unfortunately, that exception is getting swallowed and not logged today, which makes this kind of thing very hard to debug. We'll definitely fix that in #272. In the meantime, try stripping out any newlines from your token file and see if that helps.

@jimmidyson
Copy link

You've got it - new line removed & works great. Well it's a newline char without a new line (just learnt about the eol option in vim). That is really annoying - sorry for the noise (but yeah something in the logs would be great).

@wmorgan
Copy link
Member

wmorgan commented Apr 12, 2016

Amazing.

@adleong
Copy link
Member

adleong commented Apr 12, 2016

Glad to hear it! This definitely showcases the need for better error logging. Stay tuned.

@adleong adleong closed this as completed Apr 12, 2016
@adleong adleong removed the ready label Apr 12, 2016
@jimmidyson
Copy link

Thanks for your help figuring this one out - now I can get on & play more :)

adleong added a commit that referenced this issue Feb 13, 2018
Each watch on the Kubernetes API made by the io.l5d.k8s namer uses a separate finagle client.  This causes stat name conflicts which results in incorrect reporting of client stats.  It may also have an adverse affect on memory usage.

We switch the io.l5d.k8s namer to use a single finagle client for all watches on the Kubernetes API.

This isolation of watches to separate finagle clients was originally done to work around #83.  However, since then, the [underlying issue](twitter/finagle#471) has been [fixed in finagle](twitter/finagle#481) making this no longer necessary.

Signed-off-by: Alex Leong <[email protected]>
Tim-Brooks pushed a commit to Tim-Brooks/linkerd that referenced this issue Dec 20, 2018
* adding RBAC stuff for Prometheus, fixes linkerd/linkerd2#82
Signed-off-by: Christopher Schmidt [email protected]

* fixed ident
Signed-off-by: Christopher Schmidt [email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants