You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I realize that the Telemetry API I believe allows for what I want ... but what I am asking for is a first-class simple and well defined/documented path
Describe the feature request
We run a reasonably large Istio environment - upwards of ~5000-10000 pods at any given time in the mesh, spread across ~50 different services, each that release at least 10 times a day... when you add this up, it creates incredible metric cardinality based on the Istio "Standard" labels tracked by the Envoy sidecars. The current suggested solution for this is to use metric federation to aggregate the metrics .. but this system has many drawbacks. We currently are doing this .. but we find there are so many drawbacks to aggregating the metrics at this layer that our developers have little faith in the metrics that are eventually exposed to them.
I'd really like to see the Istio Control Plane have a first-class setting where we (cluster operators) define the list of dimensions by which the Envoy sidecar then stores metrics. This would allow cluster operators to easily define their own metric cardinality behaviors without implementing complex Prometheus->Prometheus pipelines, multiple scraping configurations, etc.
Describe alternatives you've considered
For the last ~2y we've run the federated metric model... but operating this at scale is difficult to say the least. There is no simple way to scale scraping thousands and thousands of high churn rate pods, and it introduces a significant operational burden on our small team.
Reasons why Envoy should handle the aggregation
Fundamentally metric aggregation can of course happen in many places... but here are some of the motivations I have for seeing this done inside the envoy process where it's already done.
It means that we aren't bound to Prometheus to collect these metrics.. if we want to use another provider (Datadog for example), we can collect the metrics directly from the pods without having to build any intermediate storage system for aggregation.
Following up on Update README.md #1, it means we can use the OTEL Collector rather than Prometheus for scraping and passing metrics downstream to our metric storage systems.
Reduced memory footprint on each Envoy because we get to define the number of unique time series that the container needs to store.
Allows for us to reduce the cardinality without having to fundamentally change the shape of the metrics from counters->gauges. We can keep the metrics in their raw counter form, as well as keep them attributed to the individual pods.
Affected product area (please put an X in all that apply)
[ ] Ambient
[ ] Docs
[ ] Dual Stack
[ ] Installation
[ ] Networking
[ ] Performance and Scalability
[x] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[x] User Experience
[ ] Developer Infrastructure
The text was updated successfully, but these errors were encountered:
Quick note - I want a "simple" way to do this
I realize that the Telemetry API I believe allows for what I want ... but what I am asking for is a first-class simple and well defined/documented path
Describe the feature request
We run a reasonably large Istio environment - upwards of ~5000-10000 pods at any given time in the mesh, spread across ~50 different services, each that release at least 10 times a day... when you add this up, it creates incredible metric cardinality based on the Istio "Standard" labels tracked by the Envoy sidecars. The current suggested solution for this is to use metric federation to aggregate the metrics .. but this system has many drawbacks. We currently are doing this .. but we find there are so many drawbacks to aggregating the metrics at this layer that our developers have little faith in the metrics that are eventually exposed to them.
I'd really like to see the Istio Control Plane have a first-class setting where we (cluster operators) define the list of dimensions by which the Envoy sidecar then stores metrics. This would allow cluster operators to easily define their own metric cardinality behaviors without implementing complex Prometheus->Prometheus pipelines, multiple scraping configurations, etc.
Describe alternatives you've considered
For the last ~2y we've run the federated metric model... but operating this at scale is difficult to say the least. There is no simple way to scale scraping thousands and thousands of high churn rate pods, and it introduces a significant operational burden on our small team.
Reasons why Envoy should handle the aggregation
Fundamentally metric aggregation can of course happen in many places... but here are some of the motivations I have for seeing this done inside the
envoy
process where it's already done.Affected product area (please put an X in all that apply)
[ ] Ambient
[ ] Docs
[ ] Dual Stack
[ ] Installation
[ ] Networking
[ ] Performance and Scalability
[x] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[x] User Experience
[ ] Developer Infrastructure
The text was updated successfully, but these errors were encountered: