-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linkerd doesn't get watch updates from namerd (likely a bug in http2 mesh) #2249
Comments
@thedebugger do you think this is related to #2251 at all? |
@adleong they both are separate issues. We are seeing this a lot since we moved to http2. We have scripts that validate the state of namerds and when this happens namerds state is as expected. The only way right now we recover is by restarting linkerds (we restart 3-4 (out of 300s) linkerds a day) which is sad. If we can prioritize #2245 , to start with namerd taces would be helpful, it can provide us clues where things are going wrong. So far no logs/metrics provide visibility to narrow this bug down. thoughts? |
@thedebugger gotcha. yes, I agree that #2245 would be hugely helpful to debug this. have you already looked at the namer state and confirmed that the affected Linkerds are not getting the update from Namerd (while unaffected Linkerds do)? |
i have same issue here, my linkerd version is 1.4.6, we retired some servers, and updated service addrs in consul, but still linkerd routes traffic to the retired server's address. curl https://127.0.0.1:9990/client_state.json give me the wrong address list. |
We have the same issue with Linkerd 1.4.6 and Namerd talking to Marathon. After some services were restarted, Linkerd continue to use old ips while all Namerd are up to date. |
@astryia I would say the mesh interface is what you should be using. Is there any chance you can update to the latest version and see whether the problem persists ? |
@zaharidichev ok, I'll try the latest version. |
Hi, originally I wanted to file a new issue but found this one so I would like to add some more details to it. Our setupWe run our services in a large Nomad cluster with hundreds of VMs. Each VM has 1 Linkerd deployed to it and any service running there connects to Linkerd on localhost. Namerd is deployed on random VMs and Linkerd discovers it using Consul DNS lookup (e.g. When I am saying restart Namerd - it is actually not a restart but a re-allocation, thus all of its instances move to random VMs and get new IPs. Here is a part of the Linkerd config: # ...
- protocol: thrift
label: port-10001
thriftProtocol: binary
interpreter:
kind: io.l5d.namerd
dst: /$/inet/namerd.service.consul/4100
namespace: 10001
retry:
baseSeconds: 1
maxSeconds: 1 # we had to set this to 1 second because Linkerd had terrible recovery time once Namerd is down for a even a few minutes
servers:
- port: 10001
ip: 0.0.0.0
thriftFramed: true
client:
thriftFramed: true
loadBalancer:
kind: ewma
maxEffort: 24
decayTimeMs: 10000
failureAccrual:
kind: io.l5d.successRate
successRate: 0
requests: 1000
backoff:
kind: constant
ms: 0
hostConnectionPool:
minSize: 0
idleTimeMs: 5000
# ... By the way, we had to reduce the exponential backoff with limiting it to 1 second max because of another nasty behavior we saw when Linkerd took too long to recover if it lost a connection with its Namerd counterpart. An example to reproduce it: Just stop Namerd and relocate the remoting service while having something requesting that service on the Linkerd. Linkerd will have old address thus giving only errors. But when Namerd gets back up - it takes from 2 to 10 minutes for Linkerd to recover, depending on how long did we wait to bring the Namerd back up. By reducing that maxSeconds setting we lowered the recovery time for Linkerd. And the config of Namerd: admin:
port: 9991
ip: 0.0.0.0
telemetry:
- kind: io.l5d.prometheus
path: /metrics
interfaces:
- kind: io.l5d.thriftNameInterpreter
port: 4100
ip: 0.0.0.0
- kind: io.l5d.httpController
port: 4321
namers:
- kind: io.l5d.consul
host: 127.0.0.1
port: 8500
includeTag: false
useHealthCheck: true
healthStatuses:
- passing
consistencyMode: stale
storage:
kind: io.l5d.inMemory
namespaces:
1: /svc => /#/io.l5d.consul/dc1/service-a # names and namespaces are made up just for the example
2: /svc => /#/io.l5d.consul/dc1/service-b
3: /svc => /#/io.l5d.consul/dc1/service-c The problemEvery time we restart Namerd, some random Linkerd instances get "stuck" with old information of remote service addresses. Once we start releasing our services and they relocate to other VMs and get new IPs and ports - some random Linkerd starts emiting service creation failures and the calling service gets the following error from Linkerd: Simplified version of service communication:
I decided to catch this bug on production, since it is impossible to reproduce it in QA. Log of events:
tcpdump result of a "good" LinkerdI was not able to decode the hex data in the tcp packet but it was pretty much clear what thrift method and args it was sending to Namerd. "good" Linkerd instance was sending a lot of packets with
Namerd was replying to that message with addresses and ip of the newly released instances of tcpdump of a "bad" Linkerd"bad" Linkerd instance for some reason did not send that to Namerd. At least not for the Both Linkerd instances "good" and "bad" had the same last log message at that time (Nov 18th):
SummarySo it looks like the "bad" instance of Linkerd got stuck with stale information about We could not ever reporoduce it on QA. It looks like it is some very rare condition so you need a big cluster with lots of services moving around to catch it. We also noticed that we started to run into this issue a lot more often once we increased our cluster by 25%. Previously it was not always the case that we run into it. But now it is almost always the case after Namerd is restarted. Next stepsWe suspect that the problem maybe somwhere in the ThriftNamerClient (probably in the |
Hi @ishubin! Thanks for all the detail in this report. Filing a new issue for this is more appropriate than commenting on this one (#2249) because this issue describes a problem with the mesh interpreter but you are reporting a problem with the thrift interpreter. I think your next steps are good, the thrift Namerd interpreter is deprecated, partially because it can sometimes get stuck like this. The mesh interpreter (which uses gRPC) has proven to be far more reliable. |
@adleong thanks for the info! Hopefully next week we will be able to try it out. Do you want me to move this comment to a separate issue? As for the gRPC being more reliable there is still one problem with it. It doesn't recover as fast as the thrift Namerd interpreter. In my tests when I bring the Namerd down and relocate remote service and later bring Namerd up the gRPC interpreter managed to recover only after 1-4 minutes (the timing was inconsistent). The same test with thrift Namerd and |
Yes, I think moving this to a separate issue will help avoid confusion. A recovery time of 1-4 minutes is much longer than I would expect, although I will note that frequent relocation of Namerd is not something that this has been optimized for. The assumption has generally been that Namerd relocation is something that happens relatively infrequently. |
@adleong regarding "the thrift Namerd interpreter is deprecated", did we miss anything in the docs? https://api.linkerd.io/1.7.4/linkerd/index.html#namerd-thrift doen't state anything about that. |
@hsmade I looked through issues for info about the deprecation, but didn't find anything. This sounds like a documentation PR and we'd love your help if you're interested |
I do need more info then. Is just thrift deprecated,or also plain http? Is
mesh the only left one left over?
…On Tue, 1 Dec 2020, 06:37 cpretzer, ***@***.***> wrote:
@hsmade <https://github.com/hsmade> I looked through issues for info
about the deprecation, but didn't find anything.
This sounds like a documentation PR and we'd love your help if you're
interested
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2249 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIVN5OZNYON5GNHJOLR5P3SSR6K3ANCNFSM4G5HSRIA>
.
|
Thanks for your help improving the project!
Filing a Linkerd issue
Issue Type:
What happened:
We had an incident where 70 (out of 140) linkerds were affected - they didn’t get watch update for the affected service from namerd. We inspected following things during the incident:
Few other things we observed during the incident
All of the above leads us to the hypothesis that there is a bug in http2 mesh interpreter client.
What you expected to happen:
Linkerds should route traffic to expected destinations
How to reproduce it (as minimally and precisely as possible):
Unfortunately, we don't know the exact steps yet
Anything else we need to know?:
Please ask for the gist (because i'm too lazy to redact it) that contains namerds bind snapshots (to validate all namerds were up to date), and linkerd mesh interpreter snapshots before and after namerd restarts (to validate some watches did up date)
Environment:
The text was updated successfully, but these errors were encountered: