Linkerd CPU Hotspots and Thread Usage #2382

j0sh3rs · 2020-03-26T18:28:35Z

Issue Type:

Bug report

What happened:
After roughly a week of running performance testing load through linkerd. (through jmeter), we experience a case where linkerd sees a sharp increase in cpu usage and thread count jump, primarily related to netty UnboundedFuturePool usage:

$ grep "UnboundedFuturePool" *json
threads.json:      "thread" : "UnboundedFuturePool-419",
threads.json:      "thread" : "UnboundedFuturePool-421",
threads.json:      "thread" : "UnboundedFuturePool-420",
threads.json:      "thread" : "UnboundedFuturePool-418",
threads.json:      "thread" : "UnboundedFuturePool-416",
threads.json:      "thread" : "UnboundedFuturePool-422",
threads_10-77-89-88.json:      "thread" : "UnboundedFuturePool-20",

When doing sampling against the profiles, the cpu hotspots look like this:

Total: 25266 samples
    7771  30.8%  30.8%     7771  30.8% io.netty.channel.epoll.Native.epollWait0
    1971   7.8%  38.6%     1971   7.8% sun.misc.Unsafe.getInt
    1654   6.5%  45.1%     1654   6.5% java.lang.Thread.currentThread
     884   3.5%  48.6%     1007   4.0% com.twitter.finagle.http.DefaultHeaderMap$Headers.elemHashCode
     856   3.4%  52.0%      856   3.4% java.lang.String.charAt
     811   3.2%  55.2%     1326   5.2% com.twitter.finagle.http.Rfc7230HeaderValidation$.validateValue

Only after restarting linkerd (by patching the daemonset pods) does the issue resolve, only to reappear

What you expected to happen:
Linkerd's thread and cpu usage remain appropriate for the load it is receiving.

How to reproduce it (as minimally and precisely as possible):
run nightly jmeter load test for 7-10 days. Note: the issue is also observed in an environment where no Jmeter test runs, suggestive that it's not specifically tied to the jmeter usage.

Anything else we need to know?:
We attempted to work around the issue, suspecting it could be related to #2268 but still saw the same behavior while running with BiasedLocking enabled.

Some core configs of our jmeter setup include:

httpclient.reset_state_on_thread_group_iteration=false
httpclient4.validate_after_inactivity=66600
httpclient4.time_to_live=70000

with Use KeepAlive checked on the jobs

Environment:

linkerd/namerd version, config files:
Linkerd 1.7.1 (running on default java8)
Config:

admin:
      port: 9990
      ip: 0.0.0.0
      socketOptions:
        reusePort: true

    namers:
    - kind: io.l5d.k8s
      host: localhost
      port: 8001

    telemetry:
    - kind: io.l5d.prometheus

    routers:
    - protocol: http
      label: path
      streamAfterContentLengthKB: 1024
      streamingEnabled: true
      client:
        failureAccrual:
          kind: none
        hostConnectionPool:
          minSize: 5
        requeueBudget:
          percentCanRetry: 20.0
      interpreter:
        kind: io.l5d.k8s.configMap
        experimental: true
        name: linkerd-dtabs
        namespace: ping-services
        filename: dtab
      identifier:
        kind: io.l5d.path
        segments: 1
        consume: true
      servers:
      - port: 4140
        ip: 0.0.0.0
        clearContext: true
        socketOptions:
          reusePort: true
      service:
        responseClassifier:
          kind: io.l5d.http.retryableRead5XX

    - protocol: http
      label: path-tls
      streamAfterContentLengthKB: 1024
      streamingEnabled: true
      client:
        failureAccrual:
          kind: none
        hostConnectionPool:
          minSize: 5
        requeueBudget:
          percentCanRetry: 20.0
      interpreter:
        kind: io.l5d.k8s.configMap
        experimental: true
        name: linkerd-dtabs
        namespace: ping-services
        filename: dtab
      identifier:
        kind: io.l5d.path
        segments: 1
        consume: true
      servers:
      - port: 4141
        ip: 0.0.0.0
        clearContext: true
        socketOptions:
          reusePort: true
        tls:
          certPath: /certificates/tls.crt
          keyPath: /certificates/tls.key

Platform, version, and config files (Kubernetes, DC/OS, etc):
Kubernetes 1.15.9 running on ubuntu 16.04
Cloud provider or hardware configuration:
AWS m5.4xlarge instance type with EBS optimizations

The text was updated successfully, but these errors were encountered:

j0sh3rs · 2020-03-26T18:28:59Z

@cpretzer since you were helping us earlier this year :)

cpretzer · 2020-03-26T18:36:22Z

thanks @j0sh3rs I'll have a look!

j0sh3rs · 2020-07-23T19:19:46Z

@adleong @cpretzer any chance this would've been solved by the recent netty, node and finagle updates from 1.7.3?

cpretzer · 2020-08-26T22:11:52Z

@j0sh3rs we've been looking into whether the recent netty and finagle updates would address this issue.

So far, I haven't been able to get a test environment running to reproduce. Can you tell me more about the jmeter tests? Are they hitting your application in a scripted way? Or do they just throw load at the application?

j0sh3rs · 2020-08-28T16:45:39Z

@cpretzer unfortunately, I've changed roles and am no longer with Ping Identity, so I don't have the context anymore to be able to troubleshoot the jemeter behaviors any longer. I'm not sure who, if anyone, has taken this over from me, so it may be this would go stale and should be closed.

cpretzer · 2020-08-28T17:33:05Z

@j0sh3rs thanks for the update! I hope your new role is going well

adleong added this to Needs investigation in Linkerd 1.x Backlog. See https://github.com/linkerd/linkerd2/blob/main/ROADMAP.md Mar 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkerd CPU Hotspots and Thread Usage #2382

Linkerd CPU Hotspots and Thread Usage #2382

j0sh3rs commented Mar 26, 2020

j0sh3rs commented Mar 26, 2020

cpretzer commented Mar 26, 2020

j0sh3rs commented Jul 23, 2020

cpretzer commented Aug 26, 2020

j0sh3rs commented Aug 28, 2020

cpretzer commented Aug 28, 2020

Linkerd CPU Hotspots and Thread Usage #2382

Linkerd CPU Hotspots and Thread Usage #2382

Comments

j0sh3rs commented Mar 26, 2020

j0sh3rs commented Mar 26, 2020

cpretzer commented Mar 26, 2020

j0sh3rs commented Jul 23, 2020

cpretzer commented Aug 26, 2020

j0sh3rs commented Aug 28, 2020

cpretzer commented Aug 28, 2020