Prometheus Receiver - Some counter metrics dropped for unknown reason #4974

gillg · 2021-05-06T09:06:37Z

Describe the bug
Analyzing an opentelemetry collector setup and trying to create dashboards with collected metrics, I discovered some droped metrics for an unknown reason.
I have a bunch of "dropped" metrics without any logs or traces.
Eventualy,
I have some logs info internal/metrics_adjuster.go:357 Adjust - skipping unexpected point {"kind": "receiver", "name": "prometheus", "type": "UNSPECIFIED"}
So they seems dropped due to an unspecified type. I added some logs on metricFamily.go to vizualize their metada, and they are empty.

As exemple this list completly disapear between the receiver and the exporter :

otel-collector    | cortex_deprecated_flags_inuse_total {Metric: Type: Help: Unit:}
otel-collector    | cortex_experimental_features_in_use_total {Metric: Type: Help: Unit:}
otel-collector    | go_memstats_alloc_bytes_total {Metric: Type: Help: Unit:}
otel-collector    | go_memstats_frees_total {Metric: Type: Help: Unit:}
otel-collector    | go_memstats_lookups_total {Metric: Type: Help: Unit:}
otel-collector    | go_memstats_mallocs_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_admin_user_created_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_dashboard_snapshot_create_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_dashboard_snapshot_external_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_dashboard_snapshot_get_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_login_oauth_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_login_post_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_login_saml_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_models_dashboard_insert_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_org_create_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_response_status_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_user_signup_completed_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_user_signup_invite_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_user_signup_started_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_aws_cloudwatch_get_metric_data_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_aws_cloudwatch_get_metric_statistics_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_aws_cloudwatch_list_metrics_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_datasource_request_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_db_datasource_query_by_id_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_emails_sent_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_instance_start_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_page_response_status_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_proxy_response_status_total {Metric: Type: Help: Unit:}
otel-collector    | http_request_total {Metric: Type: Help: Unit:}
otel-collector    | loki_logql_querystats_duplicates_total {Metric: Type: Help: Unit:}
otel-collector    | loki_logql_querystats_ingester_sent_lines_total {Metric: Type: Help: Unit:}
otel-collector    | process_cpu_seconds_total {Metric: Type: Help: Unit:}

I had to put custom logs to understand, and the seems dropped because the Type is unspecified and they have no metadata from metricFamily.go

For an unknown reason some other metrics ending with _total are working like :

node_cpu_seconds_total{cpu="0",mode="idle"}

Steps to reproduce
I don't know exactly... Try to scrap a grafana API on https://grafana:3000/metrics

What did you expect to see?
Metrics should be kept internaly, then visible at the exporter side.

What did you see instead?
No metrics at the exporter side, and probably dropped on metrics_adjuster.go (a log with metric name is definitely missing here)

What version did you use?
0.25

The text was updated successfully, but these errors were encountered:

gillg · 2021-05-06T09:58:56Z

All these dropped metrics has also negative effects at exporter level.

Each of them produce this exception on prometheus exporter. The metric seems "partialy" dropped, a reference should stay somwhere but without a valid name and type...

otel-collector    | 2021-05-06T09:56:33.720Z    error   prometheusexporter/accumulator.go:103   failed to translate metric      {"kind": "exporter", "name": "prometheus", "data_type": "\u0000", "metric_name": ""}
otel-collector    | go.opentelemetry.io/collector/exporter/prometheusexporter.(*lastValueAccumulator).addMetric
otel-collector    |     go.opentelemetry.io/collector/exporter/prometheusexporter/accumulator.go:103
otel-collector    | go.opentelemetry.io/collector/exporter/prometheusexporter.(*lastValueAccumulator).Accumulate
otel-collector    |     go.opentelemetry.io/collector/exporter/prometheusexporter/accumulator.go:74
otel-collector    | go.opentelemetry.io/collector/exporter/prometheusexporter.(*collector).processMetrics
otel-collector    |     go.opentelemetry.io/collector/exporter/prometheusexporter/collector.go:54
otel-collector    | go.opentelemetry.io/collector/exporter/prometheusexporter.(*prometheusExporter).ConsumeMetrics
otel-collector    |     go.opentelemetry.io/collector/exporter/prometheusexporter/prometheus.go:100
otel-collector    | go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsRequest).export
otel-collector    |     go.opentelemetry.io/collector/exporter/exporterhelper/metricshelper.go:57
otel-collector    | go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send
otel-collector    |     go.opentelemetry.io/collector/exporter/exporterhelper/common.go:215
otel-collector    | go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
otel-collector    |     go.opentelemetry.io/collector/exporter/exporterhelper/queued_retry.go:241
otel-collector    | go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
otel-collector    |     go.opentelemetry.io/collector/exporter/exporterhelper/metricshelper.go:120
otel-collector    | go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).send
otel-collector    |     go.opentelemetry.io/collector/exporter/exporterhelper/queued_retry.go:171
otel-collector    | go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsExporter).ConsumeMetrics
otel-collector    |     go.opentelemetry.io/collector/exporter/exporterhelper/metricshelper.go:74
otel-collector    | go.opentelemetry.io/collector/processor/batchprocessor.(*batchMetrics).export
otel-collector    |     go.opentelemetry.io/collector/processor/batchprocessor/batch_processor.go:285
otel-collector    | go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).sendItems
otel-collector    |     go.opentelemetry.io/collector/processor/batchprocessor/batch_processor.go:183
otel-collector    | go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).processItem
otel-collector    |     go.opentelemetry.io/collector/processor/batchprocessor/batch_processor.go:156
otel-collector    | go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).startProcessingCycle
otel-collector    |     go.opentelemetry.io/collector/processor/batchprocessor/batch_processor.go:141

rakyll · 2021-05-06T18:00:11Z

cc @Aneurysm9 @alolita

bogdandrutu · 2021-05-07T18:45:10Z

Most likely they are the new types "gauge histogram, enum or state" which we don't support

gburton1 · 2021-05-12T17:49:47Z

We are seeing the same. Even on a counter as simple as this, which is present on the endpoint that the Prom receiver is scraping, it does not appear on the exporter side.

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 7368.08

gillg · 2021-05-12T17:57:16Z

Most likely they are the new types "gauge histogram, enum or state" which we don't support

I don't understand, I compared some metrics with exactly the same name on a node-exporter and on grafana internal exporter, the metric is scrapped correctly on node exporter but not on grafana.... It makes no sense.

I thought to a invalid byte somewhere on grafana, gzip compression, transfer encoding but I found nothing for now...

Aneurysm9 · 2021-05-12T19:11:54Z

I wonder if this is related to the adjustment logic being removed in open-telemetry/opentelemetry-collector#3047 that was dropping some samples. Can you try with a build from that branch, or try once it lands on the trunk?

gillg · 2021-05-14T16:10:15Z

Don't you think there is a link with https://github.com/open-telemetry/opentelemetry-collector/issues/2852 ? It seems a "regresssion" or something like that if it was working before march.
It seems not really important to have a good value in a counter if it's variation is consistent with the original exporter, but here I start to have some doubts. On some graphs based on cpu_time_total I have negative values.

@Aneurysm9 I didn't have the time to try the branch from you PR, I will try it asap.

gillg · 2021-05-17T08:35:54Z

@Aneurysm9 because this PR was merged I tried directly on master, but it's the same thing. 😞
What make me perplex is because this only appears on Grafana metrics page.

As example, on node-exporter otel-collector succeed to record

# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total{otel_job="node-exporter"} 6.332249818e+09 1621240036929

On grafana job this metric is dropped. The original value is :

# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 2.46306915e+08

Even if I only keep Grafana receiver job, the result is the same.

gillg · 2021-05-26T20:14:50Z

Sadly I was very happy but this not fix the issue.
This fix change the exporter accumulator, but here the issue is at the receiver level.
The exception on the exporter accumulator stay the same with this new fix until the data is "corrupted" at the receiver level.

gillg · 2021-05-26T20:15:48Z

@bogdandrutu can you reopen the issue please ?

bruuuuuuuce · 2021-06-25T23:26:28Z

@gillg I am facing a similar issue with the prometheus receiver, did you find a workaround that worked for you?

gillg · 2021-06-26T06:13:44Z

@gillg I am facing a similar issue with the prometheus receiver, did you find a workaround that worked for you?

Hello, unfortunately nothing for now. It's definitely not systematic, not a majority of metrics, but present a lot in some contexts like grafana metrics.

bruuuuuuuce · 2021-06-26T17:10:44Z

@gillg I am facing a similar issue with the prometheus receiver, did you find a workaround that worked for you?

Hello, unfortunately nothing for now. It's definitely not systematic, not a majority of metrics, but present a lot in some contexts like grafana metrics.

Thanks for the response! I ended up changing my metrics to add the # TYPE metric_name metric_type above each metric as per the prometheus spec https://prometheus.io/docs/instrumenting/exposition_formats/#line-format. This fixed my issue with Open Telemetry. It seems that some code does not implement this, as per prometheus it says that the TYPE is optional.

gillg · 2021-06-29T14:22:28Z

@bruuuuuuuce what is wrong here ?

# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 2.46306915e+08

It's an exemple of metric considered "untyped" but the TYPE counter is defined.

bruuuuuuuce · 2021-06-29T16:09:41Z

@gillg I am not sure, from my understanding of prometheus metrics, that formatting looks right to me. Why do you say that it is untyped?

gillg · 2021-06-29T16:14:07Z

Because this produce traces from the first note.
The metrics seems correctly exposed by the exporter, but someting goes wrong during the processing and we lose all metrics metadata.
It's mostly frequent on https://grafana:3000/metrics page.

gillg · 2021-06-29T22:58:58Z

Does anyone can reopen this issue please ?

Aneurysm9 · 2021-06-30T01:15:41Z

Does anyone can reopen this issue please ?

@bogdandrutu @alolita

gillg · 2021-07-28T14:52:45Z

Someone to re-open ? 🙏

I discovered a lot of new use cases where my _total metrics are dropped somwere between the receiver and the exporter.

As new exemple a set of metrics :

# HELP thanos_compact_garbage_collection_failures_total Total number of failed garbage collection operations.
# TYPE thanos_compact_garbage_collection_failures_total counter
thanos_compact_garbage_collection_failures_total 0
# HELP thanos_compact_garbage_collection_total Total number of garbage collection operations.
# TYPE thanos_compact_garbage_collection_total counter
thanos_compact_garbage_collection_total 4
# HELP thanos_compact_group_compaction_runs_completed_total Total number of group completed compaction runs. This also includes compactor group runs that resulted with no compaction.
# TYPE thanos_compact_group_compaction_runs_completed_total counter
thanos_compact_group_compaction_runs_completed_total{group="0@15509005380717446393"} 4
thanos_compact_group_compaction_runs_completed_total{group="300000@15509005380717446393"} 4
thanos_compact_group_compaction_runs_completed_total{group="3600000@15509005380717446393"} 4
# HELP thanos_compact_group_compaction_runs_started_total Total number of group compaction attempts.
# TYPE thanos_compact_group_compaction_runs_started_total counter
thanos_compact_group_compaction_runs_started_total{group="0@15509005380717446393"} 4
thanos_compact_group_compaction_runs_started_total{group="300000@15509005380717446393"} 4
thanos_compact_group_compaction_runs_started_total{group="3600000@15509005380717446393"} 4
# HELP thanos_compact_group_compactions_failures_total Total number of failed group compactions.
# TYPE thanos_compact_group_compactions_failures_total counter
thanos_compact_group_compactions_failures_total{group="0@15509005380717446393"} 0
thanos_compact_group_compactions_failures_total{group="300000@15509005380717446393"} 0
thanos_compact_group_compactions_failures_total{group="3600000@15509005380717446393"} 0
# HELP thanos_compact_group_compactions_total Total number of group compaction attempts that resulted in a new block.
# TYPE thanos_compact_group_compactions_total counter
thanos_compact_group_compactions_total{group="0@15509005380717446393"} 0
thanos_compact_group_compactions_total{group="300000@15509005380717446393"} 0
thanos_compact_group_compactions_total{group="3600000@15509005380717446393"} 0
# HELP thanos_compact_group_vertical_compactions_total Total number of group compaction attempts that resulted in a new block based on overlapping blocks.
# TYPE thanos_compact_group_vertical_compactions_total counter
thanos_compact_group_vertical_compactions_total{group="0@15509005380717446393"} 0
thanos_compact_group_vertical_compactions_total{group="300000@15509005380717446393"} 0
thanos_compact_group_vertical_compactions_total{group="3600000@15509005380717446393"} 0

clmssz · 2021-08-24T16:11:11Z

Hi,
Having this exact same issue on main, with random properly typed metrics, can this issue be re-opened please ?

locmai · 2021-08-24T16:23:26Z

I found out that the prometheusexporter is dropping some metrics on it end due to this
https://github.com/open-telemetry/opentelemetry-collector/blob/cc41009d95166f0c5ab9a9bddbab9ec903ed163d/exporter/prometheusexporter/accumulator.go#L198

To check if the metric has been dropped by prometheusexporter , remove all the processors, then add a file exporter with the JSON format to see if the data was there. If it was, then we could be sure that it has been dropped by the exporter.

Otherwise we could look at the prometheusreceiver next.

ericmustin · 2021-08-27T22:01:03Z

👋, just to +1, we have some end users reporting issues with either this issue or #4907. Noticed you re-opened this issue @Aneurysm9 , are you currently investigating this?

cc @mx-psi

etiennejournet · 2021-11-01T22:15:49Z

Hello, I also might have the same problem, running 0.37.1. @gillg have you been able to find a solution ?

See the metrics points missing :

In this I'm trying to scrape container_cpu_usage_seconds_total (which is a counter).

Might be unrelated but : I'm running the collector in standalone mode (single replica) and I noticed that metrics are always consistent for targets being on the same node as the collector. For targets located on another node it's erratic.

Will update.

gillg · 2021-11-01T22:20:31Z

Hello @etiennejournet for now I don't have any solution about dropped counters. But this scenario seems not match with your graph. In your case you have dropped points and not dropped entire time serie.
I suspect in your case the problem with a potential memory leak in Prometheus receiver. In this case after some hours / days the collector starts to drop metrics due to memory limit reached.

etiennejournet · 2021-11-01T22:36:17Z

Thanks for you answer ;)

I don't think it's the memory leak problem, I had a careful look to your other thread about that and my opentelemetry setup doesn't show any sign of refused/dropped metrics endpoints in the receiver or exporters, I don't even see the memory leak in my own monitoring.

I'm going to dig that further, thanks for your time ;)

gillg · 2022-01-11T09:10:46Z

This problem seems have disapear with the latest otel collector contrib (v0.42.0 as today).
Not sure how, but my initial guess was once the scapper was converted to internal otel metrics instead of opencensus.

Can be closed for now :)

…pen-telemetry#4974) * When Run fails due to a startup issue it will set the state to closed Signed-off-by: Corbin Phelps <[email protected]> * Fixed changelog Signed-off-by: Corbin Phelps <[email protected]>

chzhuo · 2023-06-08T08:08:20Z

Same problem, running v0.78.0!
I am using prometheusreciver to scrap node-exporter and all metrics ending with _total be dropped

prskr · 2023-06-15T11:13:22Z

Same here, although not all with _total are dropped but for instance container_cpu_usage_seconds_total (from cAdvisor) is dropped although container_cpu_usage_seconds is present

gillg · 2023-06-15T11:36:02Z

@chzhuo @baez90 your problem is different.
It's related to this: #20518

Discusstion related here: #21743

Workaround in 0.78.0: --feature-gates=-pkg.translator.prometheus.NormalizeName
Will be reverted by default in 0.80.0

bogdandrutu closed this as completed in open-telemetry/opentelemetry-collector#3310 May 26, 2021

Aneurysm9 reopened this Aug 27, 2021

bogdandrutu transferred this issue from open-telemetry/opentelemetry-collector Aug 30, 2021

alolita added comp:aws AWS components comp:prometheus Prometheus related issues labels Sep 2, 2021

gillg closed this as completed Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus Receiver - Some counter metrics dropped for unknown reason #4974

Prometheus Receiver - Some counter metrics dropped for unknown reason #4974

gillg commented May 6, 2021

gillg commented May 6, 2021

rakyll commented May 6, 2021

bogdandrutu commented May 7, 2021

gburton1 commented May 12, 2021

gillg commented May 12, 2021

Aneurysm9 commented May 12, 2021

gillg commented May 14, 2021 •

edited

Loading

gillg commented May 17, 2021

gillg commented May 26, 2021

gillg commented May 26, 2021

bruuuuuuuce commented Jun 25, 2021

gillg commented Jun 26, 2021

bruuuuuuuce commented Jun 26, 2021

gillg commented Jun 29, 2021

bruuuuuuuce commented Jun 29, 2021

gillg commented Jun 29, 2021 •

edited

Loading

gillg commented Jun 29, 2021

Aneurysm9 commented Jun 30, 2021

gillg commented Jul 28, 2021

clmssz commented Aug 24, 2021 •

edited

Loading

locmai commented Aug 24, 2021

ericmustin commented Aug 27, 2021

etiennejournet commented Nov 1, 2021

gillg commented Nov 1, 2021 •

edited

Loading

etiennejournet commented Nov 1, 2021

gillg commented Jan 11, 2022

chzhuo commented Jun 8, 2023

prskr commented Jun 15, 2023

gillg commented Jun 15, 2023

Prometheus Receiver - Some counter metrics dropped for unknown reason #4974

Prometheus Receiver - Some counter metrics dropped for unknown reason #4974

Comments

gillg commented May 6, 2021

gillg commented May 6, 2021

rakyll commented May 6, 2021

bogdandrutu commented May 7, 2021

gburton1 commented May 12, 2021

gillg commented May 12, 2021

Aneurysm9 commented May 12, 2021

gillg commented May 14, 2021 • edited Loading

gillg commented May 17, 2021

gillg commented May 26, 2021

gillg commented May 26, 2021

bruuuuuuuce commented Jun 25, 2021

gillg commented Jun 26, 2021

bruuuuuuuce commented Jun 26, 2021

gillg commented Jun 29, 2021

bruuuuuuuce commented Jun 29, 2021

gillg commented Jun 29, 2021 • edited Loading

gillg commented Jun 29, 2021

Aneurysm9 commented Jun 30, 2021

gillg commented Jul 28, 2021

clmssz commented Aug 24, 2021 • edited Loading

locmai commented Aug 24, 2021

ericmustin commented Aug 27, 2021

etiennejournet commented Nov 1, 2021

gillg commented Nov 1, 2021 • edited Loading

etiennejournet commented Nov 1, 2021

gillg commented Jan 11, 2022

chzhuo commented Jun 8, 2023

prskr commented Jun 15, 2023

gillg commented Jun 15, 2023

gillg commented May 14, 2021 •

edited

Loading

gillg commented Jun 29, 2021 •

edited

Loading

clmssz commented Aug 24, 2021 •

edited

Loading

gillg commented Nov 1, 2021 •

edited

Loading