Add request metric to RestController to track success/failure (by status code) #109957
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We would like to track the status code of each request through a new metric. We would like it to be internal to ES, so we have full ownership of it, and we can make changes and build the needed dashboards and alerts.
The main question we want to answer is: "What is the percentage of failed requests compared to successful one for a certain project, per API or group of API?"
This PR address this issue by adding a synchronous counter,
es.rest.requests.total
, with 3 attributes,es_rest_handler_name
,es_rest_status_code
andes_rest_request_method
. This way it should be possible to create custom dashboards and alarms based on API, group of APIs (filtering by e.g.labels.es_rest_handler_name: cat_*
), success/failure (filtering by e.g.numeric_labels.es_rest_status_code
)This is currently a WIP draft, several unit tests need to be added, but existing unit tests and IT tests pass, and I have done some rally runs to validate performances and there was no issue.
Which brings me to the open questions:
incrementBy
call on every rest request could cause performance problems; it seems there is no problem, all runs have been close to the starting case (current main), within variance (between +3% and -2%). It seems the Java agent handles aggregation with the same attributes quite well.BUT if we are worried this might be a problem we can do aggregation ourselves, and be a bit more aggressive. Note that I don't think we can use a async counter here: they are designed to produce one single metric per call, and if we use attributes (and I think we need to, especially for the "API" part), we cannot do that.After discussion, given the Rally result, and double checking that the APM Java agent aggregates increments with the same attributes (it does), and reducing the attribute space by using RestControllers names, we are happy with this for now. We can revisit this if we run into problems or limitations in the future.
I am saving the request path as a way to identify an API or groups of them, but I'm wonder if instead I should use the RestHandler routes? Or some sort of "RestHandler name/id"? (I feel this might be related to the capabilities work somehow)After discussion with the team, this has been replaced by
RestController#getName
("pulling up" the existingBaseRestHandler#getName
, defaulting to classname)Related JIRA: ES-7655