Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC unary throughput and latency issues #1379

Open
stevej opened this issue Jun 12, 2017 · 1 comment
Open

gRPC unary throughput and latency issues #1379

stevej opened this issue Jun 12, 2017 · 1 comment

Comments

@stevej
Copy link

stevej commented Jun 12, 2017

We are getting reports of linkerd adding 10ms latency overhead at p50 with high concurrency.

I'm not able to reproduce the 10ms overhead at p50 but I have been able to find a throughput ceiling of 12.5k RPS which happens at a concurrency level of 20. After this, adding more clients doesn't help with throughput and only hurts latency. CPU utilization doesn't seem to go higher than 85% of total available CPU on a process with 6 cores pinned to it via taskset

Here's a few representative lines from strest-client against linkerd

Concurrency: 5
2017-06-12T09:46:53-07:00 0.0B 91011/0 10s L: 0 [ 0 2 ] 27 J: 0 0
Concurrency: 10
2017-06-12T09:48:44-07:00 0.0B 115573/0 10s L: 0 [ 2 3 ] 17 J: 0 0
Concurrency: 20
2017-06-12T09:50:00-07:00 0.0B 127035/0 10s L: 0 [ 4 6 ] 26 J: 0 0
Concurrency: 50
2017-06-12T09:51:09-07:00 0.0B 126739/0 10s L: 0 [ 12 21 ] 65 J: 0 0
Concurrency: 100
2017-06-12T09:52:38-07:00 0.0B 116632/0 10s L: 0 [ 35 55 ] 106 J: 0 0

Compare this with querying strest-server directly:

Concurrency: 5
2017-06-12T09:59:58-07:00 0.0B 312830/0 10s L: 0 [ 0 0 ] 1 J: 0 0
Concurrency: 10
2017-06-12T10:00:10-07:00 0.0B 511413/0 10s L: 0 [ 0 0 ] 7 J: 0 0
Concurrency: 20
2017-06-12T10:00:30-07:00 0.0B 702391/0 10s L: 0 [ 0 1 ] 7 J: 0 0
Concurrency: 50
2017-06-12T10:12:24-07:00 0.0B 1010429/0 10s L: 0 [ 1 2 ] 30 J: 0 0
Concurrency: 100
2017-06-12T10:12:01-07:00 0.0B 1247139/0 10s L: 0 [ 2 3 ] 33 J: 0 0

Linkerd is resulting in a 90% drop in throughput due to some bottleneck. I don't have a smoking gun yet but this smells like Amdahl's law in action.

The best piece of evidence I have so far are the two attached flame graphs, one showing linkerd routing grpc unary traffic from strest-client and one showing routing grpc streaming traffic. What appears in the unary routing is a really high (> 40 frames) filter pipeline. So far, that's the only major difference but I'm still digging into whether there's monitor contention or some other parameter that's mistuned.
grpc_svgs.zip

@adleong
Copy link
Member

adleong commented Oct 15, 2018

Probably related to #2125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants