Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out success rate classification in telemetry #634

Closed
siggy opened this issue Mar 27, 2018 · 7 comments
Closed

Figure out success rate classification in telemetry #634

siggy opened this issue Mar 27, 2018 · 7 comments

Comments

@siggy
Copy link
Member

siggy commented Mar 27, 2018

In #536 we introduced grpc_status_code and status_code to enable success rate calculations.
A naive success rate calculation would look something like:

(
  sum(irate(response_total{grpc_status_code="0"}[20s])) + 
  sum(irate(response_total{status_code="200"}[20s]))
) / 
sum(irate(response_total{}[20s]))

Note that status_code="200" doesn't count non-200 successes. Assuming we really want to count status_code<400, a couple options come to mind:

  1. add a success=true|false label to the proxy's response_total metric
  2. add a prometheus relabel_config to relabel all metrics with status_code's < 400 (this may require enumerating every integer [0...400)

Related to #627

@siggy siggy added this to the 0.4.0 milestone Mar 27, 2018
@olix0r
Copy link
Member

olix0r commented Mar 27, 2018

I think you need to do something like:

  sum(irate(response_total{status_code="200"}[20s])) -
  sum(irate(response_total{grpc_status_code!="0"}[20s]))

because all grpc messages have a 200 response code

@olix0r
Copy link
Member

olix0r commented Mar 27, 2018

The plan for this more generally is that routes will have some configurable classification logic, so this should be pushed down into the proxy.

@hawkw
Copy link
Member

hawkw commented Mar 27, 2018

Bear in mind also that responses with no status_code label and no grpc_status_code label are likely to be some kind of protocol-level error, which we don't currently make explicit (see #599)

@siggy siggy added area/proxy priority/P1 Planned for Release labels Mar 28, 2018
@olix0r olix0r added priority/P0 Release Blocker and removed priority/P1 Planned for Release labels Mar 28, 2018
@olix0r
Copy link
Member

olix0r commented Mar 28, 2018

I think the proxy should add a classification label to response metrics. The possible values of which are currently success or failure.

@hawkw
Copy link
Member

hawkw commented Mar 28, 2018

Question regarding gRPC statuses: it seems like some gRPC status codes (e.g. "Not Found", "Already Exists", "Permission Denied", etc) indicate errors on the client side rather than the server side. If we count the analogous HTTP 4xx error codes as "successes", it seems like some non-zero gRPC status codes might be counted as successes as well?

hawkw added a commit that referenced this issue Mar 28, 2018
This PR adds a `classification` label to proxy response metrics, as @olix0r described in #634 (comment). The label is either "success" or "failure", depending on the following rules:
+ **if** the response had a gRPC status code, *then*
   - gRPC status code 0 is considered a success
   - all others are considered failures
+ **else if** the response had an HTTP status code, *then*
  - status codes < 500 are considered success,
  - status codes >= 500 are considered failures
+ **else if** the response stream failed **then**
  - the response is a failure.

I've also added end-to-end tests for the classification of HTTP responses (with some work towards classifying gRPC responses as well). Additionally, I've updated `doc/proxy_metrics.md` to reflect the added `classification` label.

Signed-off-by: Eliza Weisman <[email protected]>
@hawkw
Copy link
Member

hawkw commented Mar 28, 2018

@siggy should #639 close this?

@siggy
Copy link
Member Author

siggy commented Mar 28, 2018

@hawkw yup thanks!

fixed in #639.

@siggy siggy closed this as completed Mar 28, 2018
olix0r pushed a commit to linkerd/linkerd2-proxy that referenced this issue Jul 7, 2018
This PR adds a `classification` label to proxy response metrics, as @olix0r described in linkerd/linkerd2#634 (comment). The label is either "success" or "failure", depending on the following rules:
+ **if** the response had a gRPC status code, *then*
   - gRPC status code 0 is considered a success
   - all others are considered failures
+ **else if** the response had an HTTP status code, *then*
  - status codes < 500 are considered success,
  - status codes >= 500 are considered failures
+ **else if** the response stream failed **then**
  - the response is a failure.

I've also added end-to-end tests for the classification of HTTP responses (with some work towards classifying gRPC responses as well). Additionally, I've updated `doc/proxy_metrics.md` to reflect the added `classification` label.

Signed-off-by: Eliza Weisman <[email protected]>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants