Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linkerd doesn't get watch updates from namerd (likely a bug in http2 mesh) #2249

Open
1 of 2 tasks
thedebugger opened this issue Mar 12, 2019 · 14 comments
Open
1 of 2 tasks

Comments

@thedebugger
Copy link

Thanks for your help improving the project!

Filing a Linkerd issue

Issue Type:

  • Bug report
  • Feature request

What happened:
We had an incident where 70 (out of 140) linkerds were affected - they didn’t get watch update for the affected service from namerd. We inspected following things during the incident:

  1. All of linkerds were on http2 mesh and the underlying connection was healthy as well - to make sure this bug is different from when underlying http2 connection goes stale because of the missing http2 ping feat
  2. We also verified that all namerds were up to date
  3. We validated that linkerds were routing traffic to the service, but to wrong destination.
  4. We did a rolling restarts of namerd hoping linkerds will connect back again and will get up to date w/ namerds. But linkerds didn’t recover.

Few other things we observed during the incident

  1. 7-8 linkerds that were running v1.3.5 + thriftmux interpreter didn't have this problem, but that could just be a coincidence too.
  2. After we restart namerds, some linkerd watches for other services did get up to date - lastUpdated in mesh interpreter state (from inspect endpoint) had the same time when the namerd restart occurred.

All of the above leads us to the hypothesis that there is a bug in http2 mesh interpreter client.

What you expected to happen:
Linkerds should route traffic to expected destinations

How to reproduce it (as minimally and precisely as possible):
Unfortunately, we don't know the exact steps yet

Anything else we need to know?:
Please ask for the gist (because i'm too lazy to redact it) that contains namerds bind snapshots (to validate all namerds were up to date), and linkerd mesh interpreter snapshots before and after namerd restarts (to validate some watches did up date)

Environment:

  • linkerd/namerd version, config files: linkerd and namerd v1.5.2, interpreter http2 mesh
  • Platform, version, and config files (Kubernetes, DC/OS, etc): k8s 1.10.5
  • Cloud provider or hardware configuration: GCP
@adleong
Copy link
Member

adleong commented Apr 1, 2019

@thedebugger do you think this is related to #2251 at all?

@thedebugger
Copy link
Author

@adleong they both are separate issues. We are seeing this a lot since we moved to http2. We have scripts that validate the state of namerds and when this happens namerds state is as expected. The only way right now we recover is by restarting linkerds (we restart 3-4 (out of 300s) linkerds a day) which is sad. If we can prioritize #2245 , to start with namerd taces would be helpful, it can provide us clues where things are going wrong. So far no logs/metrics provide visibility to narrow this bug down. thoughts?

@adleong
Copy link
Member

adleong commented Apr 8, 2019

@thedebugger gotcha. yes, I agree that #2245 would be hugely helpful to debug this. have you already looked at the namer state and confirmed that the affected Linkerds are not getting the update from Namerd (while unaffected Linkerds do)?

@coldfire-x
Copy link

i have same issue here, my linkerd version is 1.4.6, we retired some servers, and updated service addrs in consul, but still linkerd routes traffic to the retired server's address.

curl http:https://127.0.0.1:9990/client_state.json give me the wrong address list.

@astryia
Copy link

astryia commented Feb 17, 2020

We have the same issue with Linkerd 1.4.6 and Namerd talking to Marathon. After some services were restarted, Linkerd continue to use old ips while all Namerd are up to date.
Currently we are using mesh, could you suggest Thrift or HTTTP are more stable?

@zaharidichev
Copy link
Member

@astryia I would say the mesh interface is what you should be using. Is there any chance you can update to the latest version and see whether the problem persists ?

@astryia
Copy link

astryia commented Feb 17, 2020

@zaharidichev ok, I'll try the latest version.

@ishubin
Copy link

ishubin commented Nov 25, 2020

Hi, originally I wanted to file a new issue but found this one so I would like to add some more details to it.
We also suffer from this bug for quite a long time already. We always run into it on random Linkerd instances after we restart Namerd and redeploy some services.

Our setup

We run our services in a large Nomad cluster with hundreds of VMs. Each VM has 1 Linkerd deployed to it and any service running there connects to Linkerd on localhost. Namerd is deployed on random VMs and Linkerd discovers it using Consul DNS lookup (e.g. namerd.service.consul:4100). We used to have 3 Namerd instances. We tried to upscale it to 6, but the same problem still remains.
Our services talk via Thrift. Linkerd and Namerd are configured with io.l5d.namerd interpreter so they also talk Thrift to each other.

When I am saying restart Namerd - it is actually not a restart but a re-allocation, thus all of its instances move to random VMs and get new IPs.
Same with releasing services - it is a rolling update of service instances and they move to random VMs + when they start they are assigned random ports.

Here is a part of the Linkerd config:

# ...
- protocol: thrift
  label: port-10001
  thriftProtocol: binary
  interpreter:
    kind: io.l5d.namerd
    dst: /$/inet/namerd.service.consul/4100
    namespace: 10001
    retry:
      baseSeconds: 1 
      maxSeconds: 1 # we had to set this to 1 second because Linkerd had terrible recovery time once Namerd is down for a even a few minutes 
  servers:
  - port: 10001
    ip: 0.0.0.0
    thriftFramed: true
  client:
    thriftFramed: true
    loadBalancer:
      kind: ewma
      maxEffort: 24
      decayTimeMs: 10000
    failureAccrual:
      kind: io.l5d.successRate
      successRate: 0
      requests: 1000
      backoff:
        kind: constant
        ms: 0
    hostConnectionPool:
      minSize: 0
      idleTimeMs: 5000
# ...

By the way, we had to reduce the exponential backoff with limiting it to 1 second max because of another nasty behavior we saw when Linkerd took too long to recover if it lost a connection with its Namerd counterpart. An example to reproduce it: Just stop Namerd and relocate the remoting service while having something requesting that service on the Linkerd. Linkerd will have old address thus giving only errors. But when Namerd gets back up - it takes from 2 to 10 minutes for Linkerd to recover, depending on how long did we wait to bring the Namerd back up. By reducing that maxSeconds setting we lowered the recovery time for Linkerd.
Originally we though that it will also fix this issue, but it didn't really help there.

And the config of Namerd:

admin:
  port: 9991
  ip: 0.0.0.0

telemetry:
- kind: io.l5d.prometheus
  path: /metrics

interfaces:
- kind: io.l5d.thriftNameInterpreter
  port: 4100
  ip: 0.0.0.0
- kind: io.l5d.httpController
  port: 4321

namers:
- kind: io.l5d.consul
  host: 127.0.0.1
  port: 8500
  includeTag: false
  useHealthCheck: true
  healthStatuses:
  - passing
  consistencyMode: stale

storage:
  kind: io.l5d.inMemory
  namespaces:
      1: /svc => /#/io.l5d.consul/dc1/service-a  # names and namespaces are made up just for the example
      2: /svc => /#/io.l5d.consul/dc1/service-b
      3: /svc => /#/io.l5d.consul/dc1/service-c

The problem

Every time we restart Namerd, some random Linkerd instances get "stuck" with old information of remote service addresses. Once we start releasing our services and they relocate to other VMs and get new IPs and ports - some random Linkerd starts emiting service creation failures and the calling service gets the following error from Linkerd: com.twitter.finagle.naming.buoyant.RichConnectionFailedExceptionWithPath: Unable to establish connection to 10.x.x.x:25404. (the ip and ports are random)

Simplified version of service communication: Service A calls Service B via Linkerd using Thrift protocol:

+-----------+  thrift  +---------+  thrift  +-----------+
| Service A | <------> | Linkerd | <------> | Service B |
+-----------+          +---------+          +-----------+

I decided to catch this bug on production, since it is impossible to reproduce it in QA.

Log of events:

  1. Nov 18th 16:40 - Service B was released in Nomad so all of its instances (it has dozens of them) moved to other VMs and got new IPs and ports.
  2. Nov 18th 16:50 Namerd was redeployed (6 instances) so it also moved to new VMs and got new IPs. The config.yaml of Namerd has not been updated, we did not add any new services. So it was just a regular restart.
  3. Nov 19th 09:39 - Service B was released in Nomad, and again all of its 81 instances got new IPs and ports.
  4. Nov 19th 09:45 - Service A started to produce exceptions like com.twitter.finagle.naming.buoyant.RichConnectionFailedExceptionWithPath: Unable to establish connection to 10.x.x.x:25404. but only on one VM (also on one Linkerd since we do per host Linkerd deployment)
  5. I have downloaded all error logs of Service A and extracted ip+port it was complaining about and checked against the ip+port of the Service B. It all matched the ips that it had on Nov 18th, so before its release on Nov 19th. So it looks like this particular Linkerd instance did not get any updates of Service B.
  6. Nov 19th 10:00 - I decided to start a tcpdump on port 4100 (which is used by Namerd for thrift interpreter) on two VMs with Linkerd running on them: a "good" one and a "broken" one (the one where we saw service creation failures and those exceptions from above).
  7. Nov 19th 10:05 - Redeployed Service B to trigger yet another rellocation of all of its instances.
  8. When Service B finished its release I stopped tcpdump on both VMs and looked into them.

tcpdump result of a "good" Linkerd

I was not able to decode the hex data in the tcp packet but it was pretty much clear what thrift method and args it was sending to Namerd.

"good" Linkerd instance was sending a lot of packets with addr message to Namerd. Those packets contained the dtab path and namespace of the Service B. It looks likes this, but I changed the real name of the service to Service B:

.....add r.......
........ ..q.[...
........ ......#.
...io.l5 d.consul
....dc1. ...#just
an-examp le-servi
ce-be... ........
......23314.....

Namerd was replying to that message with addresses and ip of the newly released instances of Service B. Since I could not decode the thrift message I at least managed to find all those addresses using a Wireshark filter like this: data contains 0A:00:00:01 and data contains 6D:AD (this IP is fake, but I was searching for all combination of Service B IPs and port that I knew at that time).

tcpdump of a "bad" Linkerd

"bad" Linkerd instance for some reason did not send that to Namerd. At least not for the Service B. In all the packets that contained addr thrift method it was only requesting for a completely different service. However at the same time there were services trying to talk to Service B via Linkerd and they kept getting errors.

Both Linkerd instances "good" and "bad" had the same last log message at that time (Nov 18th):

E 1118 16:52:26.846 GMT THREAD529 TraceId:41b99eb3a22972c7: addr on /#/io.l5d.consul/dc1/service-b
Failure(syscall:getsockopt(..) failed: Connection refused: namerd.service.consul/10.x.x.x:4100 at remote address: namerd.service.consul/10.x.x.x:4100. Remote Info: Not Available, flags=0x18) with RemoteInfo -> Upstream Address: /127.0.0.1:42708, Upstream id: Not Available, Downstream Address: namerd.service.consul/10.x.x.x:4100, Downstream label: interpreter/io.buoyant.namerd.iface.NamerdInterpreterConfig, Trace Id: 554952f66844976c.554952f66844976c<:554952f66844976c with Service -> interpreter/io.buoyant.namerd.iface.NamerdInterpreterConfig
Caused by: com.twitter.finagle.ConnectionFailedException: syscall:getsockopt(..) failed: Connection refused: namerd.service.consul/10.x.x.x:4100 at remote address: namerd.service.consul/10.x.x.x:4100. Remote Info: Not Available
	at com.twitter.finagle.netty4.ConnectionBuilder$$anon$1.operationComplete(ConnectionBuilder.scala:100)
	at com.twitter.finagle.netty4.ConnectionBuilder$$anon$1.operationComplete(ConnectionBuilder.scala:79)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:629)
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:648)
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:522)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:423)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:330)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:23)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:819)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: Connection refused: namerd.service.consul/10.x.x.x:4100
	at io.netty.channel.unix.Socket.finishConnect(..)(Unknown Source)
Caused by: io.netty.channel.unix.Errors$NativeConnectException: syscall:getsockopt(..) failed: Connection refused
	... 1 more

Summary

So it looks like the "bad" instance of Linkerd got stuck with stale information about Service B and stopped requesting any updates for it. It did request some updates about another service though. To fix this we had to restart that Linkerd and then it started working correctly.

We could not ever reporoduce it on QA. It looks like it is some very rare condition so you need a big cluster with lots of services moving around to catch it. We also noticed that we started to run into this issue a lot more often once we increased our cluster by 25%. Previously it was not always the case that we run into it. But now it is almost always the case after Namerd is restarted.

Next steps

We suspect that the problem maybe somwhere in the ThriftNamerClient (probably in the loop method). We are going to try to switch to mesh interpreter to see if it fixes the problem.

@adleong
Copy link
Member

adleong commented Nov 25, 2020

Hi @ishubin! Thanks for all the detail in this report. Filing a new issue for this is more appropriate than commenting on this one (#2249) because this issue describes a problem with the mesh interpreter but you are reporting a problem with the thrift interpreter.

I think your next steps are good, the thrift Namerd interpreter is deprecated, partially because it can sometimes get stuck like this. The mesh interpreter (which uses gRPC) has proven to be far more reliable.

@ishubin
Copy link

ishubin commented Nov 25, 2020

@adleong thanks for the info! Hopefully next week we will be able to try it out. Do you want me to move this comment to a separate issue?

As for the gRPC being more reliable there is still one problem with it. It doesn't recover as fast as the thrift Namerd interpreter. In my tests when I bring the Namerd down and relocate remote service and later bring Namerd up the gRPC interpreter managed to recover only after 1-4 minutes (the timing was inconsistent). The same test with thrift Namerd and maxSeconds set to 1 in retry gave consistent result of recovering in 1 second once the Namerd becomes healthy in Consul). By recovering I mean that Linkerd updates the addresses of remote service and its clients stop receiving errors.
I couldn't find any settings for gRPC interpreter to play with. Maybe you could point me to some properties that I could experiment with?

@adleong
Copy link
Member

adleong commented Nov 25, 2020

Yes, I think moving this to a separate issue will help avoid confusion.

A recovery time of 1-4 minutes is much longer than I would expect, although I will note that frequent relocation of Namerd is not something that this has been optimized for. The assumption has generally been that Namerd relocation is something that happens relatively infrequently.

@hsmade
Copy link

hsmade commented Nov 26, 2020

@adleong regarding "the thrift Namerd interpreter is deprecated", did we miss anything in the docs? https://api.linkerd.io/1.7.4/linkerd/index.html#namerd-thrift doen't state anything about that.

@cpretzer
Copy link
Contributor

cpretzer commented Dec 1, 2020

@hsmade I looked through issues for info about the deprecation, but didn't find anything.

This sounds like a documentation PR and we'd love your help if you're interested

@hsmade
Copy link

hsmade commented Dec 1, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

8 participants