Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should prometheus metric name normalization be exposed as a configuration option? #21743

Closed
bryan-aguilar opened this issue May 10, 2023 · 16 comments · Fixed by #24260
Closed

Comments

@bryan-aguilar
Copy link
Contributor

Component(s)

exporter/prometheus, exporter/prometheusremotewrite, receiver/prometheus

Describe the issue you're reporting

In #20519 the pkg.translator.prometheus.NormalizeName was enabled by default causing full prometheus metric name normalization to be used instead of simple. The intent of feature gates is that they will eventually be removed and thus the option to revert back to simple name normalization will be lost.

I believe that simple vs full name normalization should be a choice by the operator of the collector. It is not clear to me if full name normalization makes sense in all use cases.

Specifically I am wondering if the permanent enabling of this flag causes the OpenTelemetry Prometheus receiver to move further away from a drop in replacement for the prometheus server. Does the prometheus server perform this same type of full metric name normalization by default?

@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@hughesjj
Copy link
Contributor

We have a working group meeting in two weeks if you would like to discuss it there. If not, I'll do so.

@bryan-aguilar
Copy link
Contributor Author

Which working group meeting are you referring to specifically?

@bryan-aguilar
Copy link
Contributor Author

bryan-aguilar commented May 10, 2023

Collector Sig Discussion on 10 May 2023
Discussion Participants: @hughesjj @bryan-aguilar @jpkrohling

  • Take prometheus interopability working group to have discussion there
  • We should also take have this discussion with language SDKs. OTEL SDKS with prometheus exporters should have the same capabilities of the collector exporters. If we need to codify the export behavior between SDK and Collector then we should consider taking this to the spec level.
  • We are also open to leaving this feature gate in place for a longer time until this discussion is resolved.

Please add anything that I may have missed.

@ekarlso
Copy link

ekarlso commented May 16, 2023

This change is effectively breaking any existing Grafana dashboard you get from various projects and vendors giving you ultimately 2 options

  • Create and maintain specific dashboards yourself
  • Have the project || community adapt and maintain dashboards yourself specifically for OTEL bsaed setups.

It will also probably go for any AlertingRule you get from upstream

@swiatekm
Copy link
Contributor

From the perspective of a user wanting to use the prometheus receiver as a drop-in replacement for Prometheus, this is a very disruptive change.

The Prometheus -> Otel normalization is very simple, as it only trims unit and _total suffixes from metrics. Unfortunately, there are a lot of very widely used metrics affected by this. For example, Kubernetes has the container_cpu_usage_seconds_total cadvisor metric, which is probably the single most used container metric in the ecosystem.

I support making this a configuration option. And if we actually want users to migrate, we should consider providing a more gradual migration path as well.

@damemi
Copy link
Contributor

damemi commented May 24, 2023

This also affects us in the googlemanagedprometheus exporter, both on the read side from the receiver and on export because we rely on the normalization functions in the Collector.

@bryan-aguilar
Copy link
Contributor Author

bryan-aguilar commented May 24, 2023

Summary from prometheus wg this morning:

  • Normalization is hard because the desired behavior can be different depending on the use case. For example, a metrics pipeline with an OTLP Receiver and Prometheus remote write exporter may tolerate/desire a different normalization mode than a pipeline with a prometheus receiver and prometheus remote write exporter.
  • We should move to make metric name normalization a configuration option.
    • the configuration should allow us to achieve the same behavior before the breaking change was made
    • It is not clear on what the default behavior should be. Full vs Simple normalization. This will be a part of the proposal and can be discussed there.
    • The proposal should investigate the ability for a prometheus receiver to add an attribute to the instrumentation scope of the metric. That attribute will signal what type of normalization was used in the receiver. A compatible exporter will then first look for that attribute rather than configuration to determine the normalization it should use.
  • There will most likely be some spec changes required. Especially if an attribute relating to metric name normalization is introduced. Metric name normalization should be clearly defined in the spec so that Language SDKs and Collector components can be compatible. See relevant spec doc here.

I will break down this summary into a list of action items. After doing that we can determine action item owners and start moving forward.

@bryan-aguilar
Copy link
Contributor Author

The Prometheus -> Otel normalization is very simple, as it only trims unit and _total suffixes from metrics. Unfortunately, there are a lot of very widely used metrics affected by this. For example, Kubernetes has the container_cpu_usage_seconds_total cadvisor metric, which is probably the single most used container metric in the ecosystem.

@swiatekm-sumo are you saying that prometheus -> otel translation converts container_cpu_usage_seconds_total metric name to container_cpu_usage_seconds? Is that behavior documented somewhere because I can't seem to find the relevant part in normalization docs here.

@swiatekm
Copy link
Contributor

The Prometheus -> Otel normalization is very simple, as it only trims unit and _total suffixes from metrics. Unfortunately, there are a lot of very widely used metrics affected by this. For example, Kubernetes has the container_cpu_usage_seconds_total cadvisor metric, which is probably the single most used container metric in the ecosystem.

@swiatekm-sumo are you saying that prometheus -> otel translation converts container_cpu_usage_seconds_total metric name to container_cpu_usage_seconds? Is that behavior documented somewhere because I can't seem to find the relevant part in normalization docs here.

To be exact, Prometheus receiver calls TrimPromSuffixes here. The implementation is here and it does what I mentioned. Though this isn't explicitly mentioned anywhere, so maybe it's actually not intended?

@bryan-aguilar
Copy link
Contributor Author

To be exact, Prometheus receiver calls TrimPromSuffixes here. The implementation is here and it does what I mentioned. Though this isn't explicitly mentioned anywhere, so maybe it's actually not intended?

Thanks for the detailed info. This may be worthy of an issue in itself, I'll take a deeper look, maybe we are missing some context. I'll add it to the list of action items that I will build around metric name normalization.

@dashpole
Copy link
Contributor

Sorry I haven't been part of relevant discussions, as i'm still on leave. The intent of the feature gate was to be transparent to users when using a prometheus receiver and exporter. Translation from prometheus naming to otel naming is done in the receiver, and the reverse is done in prometheus exporters. It should not be breaking for anyone using a prometheus receiver and exporter as a drop-in replacement for the prometheus server.

The transition is intended to be gradual, and the feature gate was in alpha (disabled by default) for 11 months. It can still be disabled with --feature-gates=-pkg.translator.prometheus.NormalizeName, and I think we should leave plenty of time for people to transition before we remove it.

@bryan-aguilar
Copy link
Contributor Author

Translation from prometheus naming to otel naming is done in the receiver, and the reverse is done in prometheus exporters. It should not be breaking for anyone using a prometheus receiver and exporter as a drop-in replacement for the prometheus server.

This would mean that there is a guarantee that metric pipelines with prom receiver -> promrw exporter and prom receiver -> otlp exporter will have different metric names because of normalization? Is that the intent? I don't know if the component documentation clearly states that.

The prometheus receiver readme says:

The Prometheus receiver is meant to minimally be a drop-in replacement for Prometheus

What does minimal drop in mean here? Does performing normalization by default in the receiver push it further from that goal? I believe @swiatekm-sumo brought up a good use case where this would not be desired. The dropping of the _total suffix from container_cpu_usage_seconds_total would be surprising to a user when using a metrics pipeline that does not have an exporter that re-normalizes such as the PRW.

@dmitryax
Copy link
Member

dmitryax commented Jun 7, 2023

I agree that this should be a configuration option. There are many scenarios (not prom->otel->prom) when the full normalization can be confusing.

Another concern is that some standard Prometheus metrics potentially can be mapped 1:1 to standard OTel metrics defined in the spec. Would covering those cases be complete normalization? I'm a bit skeptical about going further that route and would vote for making simple normalization a default option.

To be exact, Prometheus receiver calls TrimPromSuffixes here. The implementation is here and it does what I mentioned. Though this isn't explicitly mentioned anywhere, so maybe it's actually not intended?

Thanks for the detailed info. This may be worthy of an issue in itself, I'll take a deeper look, maybe we are missing some context. I'll add it to the list of action items that I will build around metric name normalization.

I submitted a PR to document the Prometheus->Otel transition #23209. Also, filed an issue about the limited translations on the receiver side #23208

The proposal should investigate the ability for a prometheus receiver to add an attribute to the instrumentation scope of the metric. That attribute will signal what type of normalization was used in the receiver. A compatible exporter will then first look for that attribute rather than configuration to determine the normalization it should use.

I don't think we should complicate it like this as long as the Prometheus exporter produces the same result for metrics both normalized and not normalized on the receiver side, which is the case AFAIK, e.g., it doesn't add the _total suffix if it's already there.

@gouthamve
Copy link
Member

Hi, I have a lot more experience with the exporter side of things, than the receiver side of things. And in the exporter, not having full normalization is quite bad imo.

For example, take the case of a metric being exported via OTLP Recv --> PRW today (full normalisation OFF): http_server_duration. Once it reaches Prometheus, there is no way to know the unit on this metric. And in OTel, both millisecond and second are likely options. Prometheus handles units by putting them in metric names, and handling metrics without units is a bad experience for Prometheus users.

For this reason, I think we should enable full normalisation by default on the exporter as soon as possible. It is an inevitable change, and something that is more painful the longer it takes as it will break more users.


Coming to the normalisation on the receiver end, it is a tricky question. Lets us consider the following usecases:

  • Prom Recv --> Prom Exporter

Whether recevier does full normalisation or not doesn't matter here as long as exporter does the full normalisation.

  • Prom Recv --> OTLP Exporter

With normalisation it becomes:

http_server_duration_seconds becomes http_server_duration WITH UNIT = s
alertmanager_notifications_total becomes alertmanager_notifications

Some names are unfortunate though:

prometheus_request_message_bytes_total becomes prometheus_request_message WITH UNIT = Byte
alertmanager_build_info becomes alertmanager_build with UNIT = ???
process_cpu_seconds_total becomes process_cpu WITH UNIT = s


So I am kinda split on the receiver normalisation. It makes sense for a lot of the metrics, but there is also a set of metrics where its a bit unfortunate.

@gouthamve
Copy link
Member

Just found this: prometheus/prometheus#12152

If this is merged, people could see warnings for OTel metrics (PossibleNonCounterWarning) until the _total is appended.

dmitryax pushed a commit that referenced this issue Jul 17, 2023
…4256)

**Description:**

Adds a new configuration option: `trim_metric_suffixes` to the
prometheus receiver. When set to true, suffixes will be trimmed from
metrics. This replaces use of
the`pkg.translator.prometheus.NormalizeName` feature-gate in the
prometheus receiver, but leaves it for exporters.

The first commit simplifies the usage of the feature-gate. The tests and
implementation were passing around an entire registry when it wasn't
necessary.

**Link to tracking Issue:**

Part of
#21743
Part of
#8950
dmitryax added a commit that referenced this issue Jul 18, 2023
…24260)

**Description:**
Add add_metric_suffixes configuration option, which can disable the
addition of type and unit suffixes.
This is backwards-compatible, since the default is true.

This is recommended by the specification for sum suffixes in
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/compatibility/prometheus_and_openmetrics.md#sums
and allowed in metadata

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/compatibility/prometheus_and_openmetrics.md#metric-metadata-1:

`Exporters SHOULD provide a configuration option to disable the addition
of _total suffixes.`

`The resulting unit SHOULD be added to the metric as OpenMetrics UNIT
metadata and as a suffix to the metric name unless the metric name
already contains the unit, or the unit MUST be omitted`

This deprecates the BuildPromCompliantName function in-favor of
BuildCompliantName, which includes the additional argument for
configuring suffixes.

**Link to tracking Issue:**

Fixes
#21743
Part of
#8950

---------

Co-authored-by: Dmitrii Anoshin <[email protected]>
michaelyeager-wf pushed a commit to michaelyeager-wf/opentelemetry-collector-contrib that referenced this issue Jan 25, 2024
…pen-telemetry#24260)

**Description:**
Add add_metric_suffixes configuration option, which can disable the
addition of type and unit suffixes.
This is backwards-compatible, since the default is true.

This is recommended by the specification for sum suffixes in
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/compatibility/prometheus_and_openmetrics.md#sums
and allowed in metadata

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/compatibility/prometheus_and_openmetrics.md#metric-metadata-1:

`Exporters SHOULD provide a configuration option to disable the addition
of _total suffixes.`

`The resulting unit SHOULD be added to the metric as OpenMetrics UNIT
metadata and as a suffix to the metric name unless the metric name
already contains the unit, or the unit MUST be omitted`

This deprecates the BuildPromCompliantName function in-favor of
BuildCompliantName, which includes the additional argument for
configuring suffixes.

**Link to tracking Issue:**

Fixes
open-telemetry#21743
Part of
open-telemetry#8950

---------

Co-authored-by: Dmitrii Anoshin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants