-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul Namer doesn't update Mesh Interpreter when the Consul payload is significantly large #2363
Comments
So far I know two things for sure:
What I think:
|
@zaharidichev thanks for taking a look at this I'll continue to look at how the H2 client is used by the mesh interface. A couple of things that I'm suspicious of are:
|
I do not think the WINDOW_UPDATE calls are a reason for any concern. I am currently debugging that and it seems to me that there is only one H2 data frame ever send from namerd to linkerd no matter how big the payload is, which might have to do with the fact that this is always an UnaryToUnary H2 mode. Not sure, this is all hypothetical at the moment. I will try to get to the bottom of it. But this is almost certainly a problem of the H2 interface. Will keep you posted :) |
@zaharidichev I continued to debug this today and made progress that confirms our conversation with @adleong today about this being a Codec issue. The communication is definitely UnaryToStream, based on this snippet from the generated the
I didn't get a chance to test with the Scala server, but a different test that I ran was to configure the the This indicates to me that the interface is not getting an ACK from the interpreter for the payload that causes the failure. In my tests, this TCP payload is never larger than 40kb, so I don't think that this is an aggregate size issue; again it seems more like how the Codec is decoding the streams and the frames. Here is a snippet of the configuration that you can use in
The I tried similar configurations of the This evening, I've been looking through There may be an issue with the frames not being assembled in the right order, but I haven't found the code that handles that. @adleong can you point me in the right direction? |
This is confirmed to be fixed in Linkerd 1.7.1 |
Issue Type:
What happened:
Version 1.7.0 of Linkerd and Namerd lose connectivity on an update when the Consul namer process a large number of replicas and sends a request with a large body.
In this scenario, Namerd is configured to use the Consul namer and Linkerd is configured to use the io.l5d.mesh interface.
When the Consul store changes, Namerd will trigger an update to Linkerd through the mesh interpreter. Namerd will send a few data payloads, which Linkerd receives, and then the traffic just stops altogether.
What you expected to happen:
The update from Namerd should complete successfully and Linkerd should have an exact copy of the list of replicas that are stored in Namerd and Consul
How to reproduce it (as minimally and precisely as possible):
Reproducing this is a multi-step process which requires setting up Consul, Linkerd, and Namerd.
Requires Java 8 or 9
mkdir consul-namer-repro
, for examplechmod 755 *1.7.0-exec
Open a terminal window for each of the commands below (You can background the processes, but it'll be more difficult to read the logs)
./start_consul.sh
: The consul logs should report that five services have been registered./start_namerd.sh
./start_linkerd.sh
docker run --name --rm nginx_one -p 8080:8080 nginx
docker run --name --rm nginx_two -p 8081:8080 nginx
This shows the number of replicas currently stored by Namerd and Linkerd, and they should all read
0
./admin-counts.sh
curl https://localhost:4140/cat
The counts should now be at
5
./admin-counts.sh
cp long.json conf-dir/ && consul reload
At this point the Namerd and Linkerd logs will show communication between each other. There may be some
ChannelClosedException
messages, but they are not related to updating the consul services./admin-counts.sh
The counts will be out of sync.
namer_state
will have about 6000client_state
andinterpreter_state
will have a much lower value and will be equalAnything else we need to know?:
Any subsequent updates to the consul replicas, such as removing the
long.json
file or addingshort-2.json
file will not have any effect. Namerd will not attempt to communicate with Linkerd, which makes me think that the connection is in an unexpected state.I have some tcpdump files with the packet information from the steps above.
To see the consul updates working normally:
long.json
fromconf-dir
rm -rf conf-dir/long.json
consul reload
cp short-2.json conf-dir/ && consul reload
./admin-counts.sh
will show that the counts are all in syncrm -rf conf-dir/short-2.json && consul reload
./admin-counts.sh
will show that the counts are all in sync againEnvironment:
linkerd/namerd version, config files:
Linkerd: 1.7.0
Namerd: 1.7.0
Consul: 1.5.1
Config files: https://gist.github.com/cpretzer/36682f0485abb106c8ca3ae028f1580c
Java Version: 1.8.0_211
Platform, version, and config files (Kubernetes, DC/OS, etc):
OS: macOS Mojave
The text was updated successfully, but these errors were encountered: