Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul Namer doesn't update Mesh Interpreter when the Consul payload is significantly large #2363

Closed
1 of 2 tasks
cpretzer opened this issue Dec 17, 2019 · 5 comments
Closed
1 of 2 tasks

Comments

@cpretzer
Copy link
Contributor

Issue Type:

  • Bug report
  • Feature request

What happened:

Version 1.7.0 of Linkerd and Namerd lose connectivity on an update when the Consul namer process a large number of replicas and sends a request with a large body.

In this scenario, Namerd is configured to use the Consul namer and Linkerd is configured to use the io.l5d.mesh interface.

When the Consul store changes, Namerd will trigger an update to Linkerd through the mesh interpreter. Namerd will send a few data payloads, which Linkerd receives, and then the traffic just stops altogether.

What you expected to happen:

The update from Namerd should complete successfully and Linkerd should have an exact copy of the list of replicas that are stored in Namerd and Consul

How to reproduce it (as minimally and precisely as possible):

Reproducing this is a multi-step process which requires setting up Consul, Linkerd, and Namerd.

Requires Java 8 or 9

  1. Create a working directory
    1. mkdir consul-namer-repro, for example
  2. Download Consul
    1. Get the version for your OS: https://www.consul.io/downloads.html
    2. Make sure that you can run it from the working directory
  3. Download Linkerd and Namerd
    1. https://github.com/linkerd/linkerd/releases/download/1.7.0/namerd-1.7.0-exec
    2. https://github.com/linkerd/linkerd/releases/download/1.7.0/linkerd-1.7.0-exec
    3. Copy the files to the working directory
    4. Make the files executable: chmod 755 *1.7.0-exec
  4. Download configuration files
    1. https://gist.github.com/cpretzer/36682f0485abb106c8ca3ae028f1580c
    2. click "Download Zip" and the tgz file will be compressed in another zip file
  5. Start the services
    Open a terminal window for each of the commands below (You can background the processes, but it'll be more difficult to read the logs)
    ./start_consul.sh: The consul logs should report that five services have been registered
    ./start_namerd.sh
    ./start_linkerd.sh
    docker run --name --rm nginx_one -p 8080:8080 nginx
    docker run --name --rm nginx_two -p 8081:8080 nginx
  6. Check the current replica counts
    This shows the number of replicas currently stored by Namerd and Linkerd, and they should all read 0
    ./admin-counts.sh
  7. Curl the Linkerd service to get the replicas
    curl https://localhost:4140/cat
  8. Check the updated replica counts
    The counts should now be at 5
    ./admin-counts.sh
  9. Increase the number of replicas in consul
    cp long.json conf-dir/ && consul reload
    At this point the Namerd and Linkerd logs will show communication between each other. There may be some ChannelClosedException messages, but they are not related to updating the consul services
  10. Check the updated replica counts
    ./admin-counts.sh
    The counts will be out of sync.
    namer_state will have about 6000
    client_state and interpreter_state will have a much lower value and will be equal

Anything else we need to know?:

Any subsequent updates to the consul replicas, such as removing the long.json file or adding short-2.json file will not have any effect. Namerd will not attempt to communicate with Linkerd, which makes me think that the connection is in an unexpected state.

I have some tcpdump files with the packet information from the steps above.

To see the consul updates working normally:

  • Remove long.json from conf-dir
    • rm -rf conf-dir/long.json
  • consul reload
  • Use ctrl-c to stop the Linkerd process, then restart it
  • cp short-2.json conf-dir/ && consul reload
  • Observe the communication between Linkerd and Namerd
  • ./admin-counts.sh will show that the counts are all in sync
  • rm -rf conf-dir/short-2.json && consul reload
  • ./admin-counts.sh will show that the counts are all in sync again

Environment:

@zaharidichev
Copy link
Member

zaharidichev commented Dec 17, 2019

So far I know two things for sure:

  • as I was expecting this has not much to do with Consul. If you scratch consul out of the equation and simply use a static file for the resolution data, if the file is significantly large (has many endpoints), you will see the same result
  • this is not a 1.7 bug. It manifests itself with 1.6.0 as well

What I think:

  • I think this is a bug with the H2 implementation or at least something wrong with how the H2 client is used by the mesh interface.

@cpretzer
Copy link
Contributor Author

@zaharidichev thanks for taking a look at this

I'll continue to look at how the H2 client is used by the mesh interface.

A couple of things that I'm suspicious of are:

  • the WINDOW_UPDATE calls
  • the payloads sent from Namerd to Linkerd contain duplicate entries. Almost as if Namerd is sending multiple payloads with the entirety of its known replica sets at the time. I've observed that Namerd doesn't get the entire payload from Consul in one go, so I wonder if there is a race condition in how the data is being sent or read.

@zaharidichev
Copy link
Member

I do not think the WINDOW_UPDATE calls are a reason for any concern. I am currently debugging that and it seems to me that there is only one H2 data frame ever send from namerd to linkerd no matter how big the payload is, which might have to do with the fact that this is always an UnaryToUnary H2 mode. Not sure, this is all hypothetical at the moment. I will try to get to the bottom of it. But this is almost certainly a problem of the H2 interface. Will keep you posted :)

@cpretzer
Copy link
Contributor Author

@zaharidichev I continued to debug this today and made progress that confirms our conversation with @adleong today about this being a Codec issue.

The communication is definitely UnaryToStream, based on this snippet from the generated the resolver.pb.scala file:

...
trait Resolver {
  def getReplicas(req: io.linkerd.mesh.ReplicasReq): com.twitter.util.Future[io.linkerd.mesh.Replicas]
  def streamReplicas(req: io.linkerd.mesh.ReplicasReq): io.buoyant.grpc.runtime.Stream[io.linkerd.mesh.Replicas]
}
...

I didn't get a chance to test with the Scala server, but a different test that I ran was to configure the the [H2.server](https://github.com/linkerd/linkerd/blob/master/namerd/iface/mesh/src/main/scala/io/buoyant/namerd/iface/MeshIfaceInitializer.scala#L103) on the MeshIfaceInitializer to a different readTimeout value set for the transport. In this case, I saw that the interface continuously sent requests to the interpreter.

This indicates to me that the interface is not getting an ACK from the interpreter for the payload that causes the failure. In my tests, this TCP payload is never larger than 40kb, so I don't think that this is an aggregate size issue; again it seems more like how the Codec is decoding the streams and the frames.

Here is a snippet of the configuration that you can use in [MeshIfaceInitializer](https://github.com/linkerd/linkerd/blob/master/namerd/iface/mesh/src/main/scala/io/buoyant/namerd/iface/MeshIfaceInitializer.scala#L103) to reproduce that particular behavior:

...
      H2.server
        .withTracer(NullTracer)
        .configuredParams(tlsParams)
        .configuredParams(socketOptParams)
        .withStatsReceiver(stats1)
        .withTransport.verbose
        .withTransport.readTimeout(1.minute)
        .serve(addr, dispatcher)
...

The 1.minute value is arbitrary and you can shorten it. The logs will show the attempts to resend at any Duration that you set.

I tried similar configurations of the [H2.client](https://github.com/linkerd/linkerd/blob/master/interpreter/mesh/src/main/scala/io/buoyant/interpreter/MeshInterpreterInitializer.scala#L87) on the MeshInterpreterInitializer side, and the behavior did not change.

This evening, I've been looking through [DecodingStream](https://github.com/linkerd/linkerd/blob/master/grpc/runtime/src/main/scala/io/buoyant/grpc/runtime/DecodingStream.scala), [Codec](https://github.com/linkerd/linkerd/blob/master/grpc/runtime/src/main/scala/io/buoyant/grpc/runtime/Codec.scala), and some of the related classes to make sure that the frames and streams are being completely read and properly drained.

There may be an issue with the frames not being assembled in the right order, but I haven't found the code that handles that. @adleong can you point me in the right direction?

This was referenced Dec 19, 2019
@cpretzer
Copy link
Contributor Author

This is confirmed to be fixed in Linkerd 1.7.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants