Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requests which are canceled due to service profile governance are not retried #2358

Closed
1 of 2 tasks
pwlodek opened this issue Nov 20, 2019 · 3 comments
Closed
1 of 2 tasks

Comments

@pwlodek
Copy link

pwlodek commented Nov 20, 2019

Issue Type:

  • Bug report

  • Feature request

What happened:
Requests which are timed out by the service profile timeout parameter are not subject to retry.

What you expected to happen:
I would like timeouts which are governed by service profile to be treated as failures, and I would like them to be retried.

How to reproduce it (as minimally and precisely as possible):
Create a simple Linkerd service profile, and set a timeout on any route to be a small value, say 500ms. Now if you call this endpoint from a meshed pod, and when it happens that the call takes more than 500ms to complete, the pod will see 504 gateway timeout. No retries are happening.

Anything else we need to know?:
Imagine you have a service endpoint /database/products. When you do GET /database/products you will receive a list of products from a database. Sometimes this call takes 50ms to complete, sometimes it takes 3sec to complete. I would like to be able to tell linkerd hey, if this request takes longer than X, cancel it AND do a retry (according to a retry budget). This situation would be analogues to a situation where this endpoint returns 500, in which case linkerd (if configured) will do a retry according to the budget.

Why is this important? Say I configured above route to timeout after 500ms. I have a meshed pod which does GET /database/products. Say the first request takes more than 500ms, in which case the request is canceled. But linkerd will do a retry, so the request is retried and lets say it will take 100ms this time. So what happened is first requestd timed out after 500ms but it was retried and the second time it took 100ms to complete. So from the calling pod point of view the request took ~600ms to complete. If linkerd didint interfere it would take 3 sec to complete. That is what should happen.

At this point, the fact that Linkerd cancels the request after X and returns 504 gateway timeout is pretty much useless.

Environment:

  • linkerd version: 2.6
  • Platform, version, and config files (Kubernetes, DC/OS, etc): Kubernetes 1.14
  • Cloud provider or hardware configuration: Docker Desktop with Kubernetes
@zaharidichev
Copy link
Member

@pwlodek I think you created that in the wrong repo, do you mind moving it to the linkerd2 one ?

@pwlodek
Copy link
Author

pwlodek commented Nov 20, 2019

Correct, please close this one as the issue pertains to Linkerd2. It is tracked here linkerd/linkerd2#3743

@pwlodek
Copy link
Author

pwlodek commented Nov 20, 2019

Closing this one

@pwlodek pwlodek closed this as completed Nov 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants