-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linkerd sporadically stops watching remote addresses in Namerd with thrift interpreter #2411
Comments
hi @ishubin thanks for the detail in this report. What you describe sounds a bit like this issue. Which version of Linkerd are you using? If you're not using 1.7.4, I suggest upgrading to see if that addresses the issue. |
@cpretzer Sorry forgot to put it in. We use Linkerd 1.7.1. We are going to upgrade but it will take some time. But just to mention: this bug existed in all versions of Linkerd in the last 2 years. At least we saw this bug from the version we started using Linkerd 2 years ago. Every time we thought that we should just upgrade. Also back then our clusters were smaller so we had less chance of running into this bug. But now, due to the size of our Nomad clusters, we always run into it with each Namerd restart. |
Thanks for the additional info @ishubin Please keep this issue updated as you roll out the upgrade. |
Issue Type:
What happened:
Linkerd with thrift interpreter gets stuck with old addresses of the service and does not update the information from Namerd anymore.
Our setup
We run our services in a large Nomad cluster with hundreds of VMs. Each VM has 1 Linkerd deployed to it and any service running there connects to Linkerd on localhost. Namerd is deployed on random VMs and Linkerd discovers it using Consul DNS lookup (e.g.
namerd.service.consul:4100
). We used to have 3 Namerd instances. We tried to upscale it to 6, but the same problem still remains.Our services talk via Thrift. Linkerd and Namerd are configured with
io.l5d.namerd
interpreter so they also talk Thrift to each other.When I am saying restart Namerd - it is actually not a restart but a re-allocation, thus all of its instances move to random VMs and get new IPs.
Same with releasing services - it is a rolling update of service instances and they move to random VMs + when they start they are assigned random ports.
Here is a part of the Linkerd config:
By the way, we had to reduce the exponential backoff with limiting it to 1 second max because of another nasty behavior we saw when Linkerd took too long to recover if it lost a connection with its Namerd counterpart. An example to reproduce it: Just stop Namerd and relocate the remoting service while having something requesting that service on the Linkerd. Linkerd will have old address thus giving only errors. But when Namerd gets back up - it takes from 2 to 10 minutes for Linkerd to recover, depending on how long did we wait to bring the Namerd back up. By reducing that maxSeconds setting we lowered the recovery time for Linkerd.
Originally we though that it will also fix this issue, but it didn't really help there.
And the config of Namerd:
The problem
Every time we restart Namerd, some random Linkerd instances get "stuck" with old information of remote service addresses. Once we start releasing our services and they relocate to other VMs and get new IPs and ports - some random Linkerd starts emiting service creation failures and the calling service gets the following error from Linkerd:
com.twitter.finagle.naming.buoyant.RichConnectionFailedExceptionWithPath: Unable to establish connection to 10.x.x.x:25404.
(the ip and ports are random)Simplified version of service communication:
Service A
callsService B
via Linkerd using Thrift protocol:I decided to catch this bug on production, since it is impossible to reproduce it in QA.
Log of events:
Service B
was released in Nomad so all of its instances (it has dozens of them) moved to other VMs and got new IPs and ports.Service B
was released in Nomad, and again all of its 81 instances got new IPs and ports.Service A
started to produce exceptions likecom.twitter.finagle.naming.buoyant.RichConnectionFailedExceptionWithPath: Unable to establish connection to 10.x.x.x:25404.
but only on one VM (also on one Linkerd since we do per host Linkerd deployment)Service A
and extracted ip+port it was complaining about and checked against the ip+port of theService B
. It all matched the ips that it had on Nov 18th, so before its release on Nov 19th. So it looks like this particular Linkerd instance did not get any updates ofService B
.Service B
to trigger yet another rellocation of all of its instances.Service B
finished its release I stopped tcpdump on both VMs and looked into them.tcpdump result of a "good" Linkerd
I was not able to decode the hex data in the tcp packet but it was pretty much clear what thrift method and args it was sending to Namerd.
"good" Linkerd instance was sending a lot of packets with
addr
message to Namerd. Those packets contained the dtab path and namespace of theService B
. It looks likes this, but I changed the real name of the service toService B
:Namerd was replying to that message with addresses and ip of the newly released instances of
Service B
. Since I could not decode the thrift message I at least managed to find all those addresses using a Wireshark filter like this:data contains 0A:00:00:01 and data contains 6D:AD
(this IP is fake, but I was searching for all combination ofService B
IPs and port that I knew at that time).tcpdump of a "bad" Linkerd
"bad" Linkerd instance for some reason did not send that to Namerd. At least not for the
Service B
. In all the packets that containedaddr
thrift method it was only requesting for a completely different service. However at the same time there were services trying to talk toService B
via Linkerd and they kept getting errors.Both Linkerd instances "good" and "bad" had the same last log message at that time (Nov 18th):
Summary
So it looks like the "bad" instance of Linkerd got stuck with stale information about
Service B
and stopped requesting any updates for it. It did request some updates about another service though. To fix this we had to restart that Linkerd and then it started working correctly.We could not ever reporoduce it on QA. It looks like it is some very rare condition so you need a big cluster with lots of services moving around to catch it. We also noticed that we started to run into this issue a lot more often once we increased our cluster by 25%. Previously it was not always the case that we run into it. But now it is almost always the case after Namerd is restarted.
Next steps
We suspect that the problem maybe somewhere in the ThriftNamerClient (probably in the
loop
method). We are going to try to switch to mesh interpreter to see if it fixes the problem.The text was updated successfully, but these errors were encountered: