Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[telemetry] provide simple settings at the mesh-level to define the metric dimensions reported by the envoy sidecars #51560

Open
diranged opened this issue Jun 13, 2024 · 0 comments

Comments

@diranged
Copy link
Contributor

diranged commented Jun 13, 2024

Quick note - I want a "simple" way to do this

I realize that the Telemetry API I believe allows for what I want ... but what I am asking for is a first-class simple and well defined/documented path

Describe the feature request

We run a reasonably large Istio environment - upwards of ~5000-10000 pods at any given time in the mesh, spread across ~50 different services, each that release at least 10 times a day... when you add this up, it creates incredible metric cardinality based on the Istio "Standard" labels tracked by the Envoy sidecars. The current suggested solution for this is to use metric federation to aggregate the metrics .. but this system has many drawbacks. We currently are doing this .. but we find there are so many drawbacks to aggregating the metrics at this layer that our developers have little faith in the metrics that are eventually exposed to them.

I'd really like to see the Istio Control Plane have a first-class setting where we (cluster operators) define the list of dimensions by which the Envoy sidecar then stores metrics. This would allow cluster operators to easily define their own metric cardinality behaviors without implementing complex Prometheus->Prometheus pipelines, multiple scraping configurations, etc.

Describe alternatives you've considered

For the last ~2y we've run the federated metric model... but operating this at scale is difficult to say the least. There is no simple way to scale scraping thousands and thousands of high churn rate pods, and it introduces a significant operational burden on our small team.

Reasons why Envoy should handle the aggregation

Fundamentally metric aggregation can of course happen in many places... but here are some of the motivations I have for seeing this done inside the envoy process where it's already done.

  1. It means that we aren't bound to Prometheus to collect these metrics.. if we want to use another provider (Datadog for example), we can collect the metrics directly from the pods without having to build any intermediate storage system for aggregation.
  2. Following up on Update README.md #1, it means we can use the OTEL Collector rather than Prometheus for scraping and passing metrics downstream to our metric storage systems.
  3. Reduced memory footprint on each Envoy because we get to define the number of unique time series that the container needs to store.
  4. Allows for us to reduce the cardinality without having to fundamentally change the shape of the metrics from counters->gauges. We can keep the metrics in their raw counter form, as well as keep them attributed to the individual pods.

Affected product area (please put an X in all that apply)

[ ] Ambient
[ ] Docs
[ ] Dual Stack
[ ] Installation
[ ] Networking
[ ] Performance and Scalability
[x] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[x] User Experience
[ ] Developer Infrastructure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants