Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retries exceeds totalTimeoutMs request budget #2369

Closed
thekia19 opened this issue Dec 27, 2019 · 2 comments
Closed

retries exceeds totalTimeoutMs request budget #2369

thekia19 opened this issue Dec 27, 2019 · 2 comments

Comments

@thekia19
Copy link

Filing a Linkerd issue

Issue Type:

  • Bug report

What happened:
Request had been retrying for 5 hours whilst request budget (totalTimeoutMs) was set on amount 5s
What you expected to happen:
Request is failed when budget is exceeded
How to reproduce it (as minimally and precisely as possible):
Sorry. I can't provide steps to reproduce. But I provide config causing the issue and config eliminating it.
Environment:

  • linkerd version 1.6.3
  • Linux node123 4.15.0-47-generic Fix sbt conflict. #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux:

Config causing issue

admin:
  ip: 0.0.0.0
  port: 9990

usage:
  enabled: false

telemetry:
- kind: io.l5d.prometheus
  prefix: cms_

namers:
- kind: io.l5d.consul
  host: 127.0.0.1
  port: 8500
  includeTag: true
  useHealthCheck: true
  token: {{ linkerd_token }}
  setHost: true
  consistencyMode: stale
  preferServiceAddress: true

routers:
  - protocol: http
    label: /cms-to-api
    identifier:
      kind: io.l5d.static
      path: /japi
    dtab: |
          /svc/japi => /#/io.l5d.consul/qa/japi/java-api;

    service:
      totalTimeoutMs: 5000
      responseClassifier:
        kind: io.l5d.http.retryableRead5XX
      retries:
        budget:
          minRetriesPerSec: 1
          percentCanRetry: 0.01
          ttlSecs: 10
        backoff:
          kind: jittered
          minMs: 1000
          maxMs: 10000

    servers:
    - port: 8180
      ip: 127.0.0.1
      addForwardedHeader:
         by: {kind: "ip:port"}
         for: {kind: ip}
    client:
      failureAccrual:
        kind: io.l5d.successRate
        successRate: 0.9
        requests: 1000
        backoff:
          kind: jittered
          minMs: 5000
          maxMs: 30000
      loadBalancer:
        kind: ewma
    httpAccessLog: access.log
    httpAccessLogRollPolicy: daily
    httpAccessLogAppend: true
    httpAccessLogRotateCount: 7
    tracePropagator:
      kind: io.l5d.zipkin

Config fixing issue (only 2 parameters differ)

... // on service level
        backoff:
          kind: jittered
          minMs: 1000
          maxMs: 10000
... // on client level
      requeueBudget:
        percentCanRetry: 0.01

When maxMs (set on 10000) was reached request was retrying again and again for 5 hours each 10 seconds causing an exception in linkerd (trace is below). On the same time application 10.20.93.22:6002 successfully responded to linkerd, but linkerd continued retrying request.
It was happening up to 20 times per day whilst application send 100k rpm to linkerd meaning that it's not critical but unpleasant thing to experience.

D 1217 06:25:59.595 UTC THREAD38 TraceId:0a3ac39af46a109b: Failed mid-stream. Terminating stream, closing connection
com.twitter.finagle.ChannelClosedException: ChannelException at remote address: /10.20.93.22:6002. Remote Info: Not Available
	at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.channelInactive(ChannelTransport.scala:175)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)

Is it an expected behavior if jittered backoff maxMs has higher value than request budget?
I believe it shouldn't be a such case. Please correct me if I wrong.

Many thanks

@adleong
Copy link
Member

adleong commented Jan 7, 2020

Hi @thekia19! I wasn't able to reproduce this issue with a simplified version of your config. Here's what I used:

namers: []

routers:
  - protocol: http
    dtab: |
          /svc/* => /$/inet/localhost/8888;

    service:
      totalTimeoutMs: 5000
      responseClassifier:
        kind: io.l5d.http.retryableRead5XX
      retries:
        budget:
          minRetriesPerSec: 1
          percentCanRetry: 0.01
          ttlSecs: 10
        backoff:
          kind: jittered
          minMs: 1000
          maxMs: 10000

    client:
      failureAccrual:
        kind: io.l5d.successRate
        successRate: 0.9
        requests: 1000
        backoff:
          kind: jittered
          minMs: 5000
          maxMs: 30000
      loadBalancer:
        kind: ewma

    servers:
    - port: 4140

I then launched a simple webserver which always responds with a status code 500 and sent a request to it through Linkerd using curl. In the webserver logs I see that there were a total of 5 requests (the original plus 4 retries) before the timeout is reached, after which no more requests are made and Linkerd returns a response with the message to curl:

exceeded 5.seconds to unspecified while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

If you can tell me more about the exact behavior of your server that causes the behavior you describe, that would help me to reproduce the issue.

@adleong
Copy link
Member

adleong commented Jan 28, 2020

Closing due to inactivity.

@adleong adleong closed this as completed Jan 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants