Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus Receiver - Some counter metrics dropped for unknown reason #4974

Closed
gillg opened this issue May 6, 2021 · 29 comments
Closed

Prometheus Receiver - Some counter metrics dropped for unknown reason #4974

gillg opened this issue May 6, 2021 · 29 comments
Labels
comp:aws AWS components comp:prometheus Prometheus related issues

Comments

@gillg
Copy link
Contributor

gillg commented May 6, 2021

Describe the bug
Analyzing an opentelemetry collector setup and trying to create dashboards with collected metrics, I discovered some droped metrics for an unknown reason.
I have a bunch of "dropped" metrics without any logs or traces.
Eventualy,
I have some logs info internal/metrics_adjuster.go:357 Adjust - skipping unexpected point {"kind": "receiver", "name": "prometheus", "type": "UNSPECIFIED"}
So they seems dropped due to an unspecified type. I added some logs on metricFamily.go to vizualize their metada, and they are empty.

As exemple this list completly disapear between the receiver and the exporter :

otel-collector    | cortex_deprecated_flags_inuse_total {Metric: Type: Help: Unit:}
otel-collector    | cortex_experimental_features_in_use_total {Metric: Type: Help: Unit:}
otel-collector    | go_memstats_alloc_bytes_total {Metric: Type: Help: Unit:}
otel-collector    | go_memstats_frees_total {Metric: Type: Help: Unit:}
otel-collector    | go_memstats_lookups_total {Metric: Type: Help: Unit:}
otel-collector    | go_memstats_mallocs_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_admin_user_created_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_dashboard_snapshot_create_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_dashboard_snapshot_external_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_dashboard_snapshot_get_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_login_oauth_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_login_post_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_login_saml_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_models_dashboard_insert_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_org_create_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_response_status_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_user_signup_completed_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_user_signup_invite_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_api_user_signup_started_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_aws_cloudwatch_get_metric_data_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_aws_cloudwatch_get_metric_statistics_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_aws_cloudwatch_list_metrics_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_datasource_request_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_db_datasource_query_by_id_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_emails_sent_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_instance_start_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_page_response_status_total {Metric: Type: Help: Unit:}
otel-collector    | grafana_proxy_response_status_total {Metric: Type: Help: Unit:}
otel-collector    | http_request_total {Metric: Type: Help: Unit:}
otel-collector    | loki_logql_querystats_duplicates_total {Metric: Type: Help: Unit:}
otel-collector    | loki_logql_querystats_ingester_sent_lines_total {Metric: Type: Help: Unit:}
otel-collector    | process_cpu_seconds_total {Metric: Type: Help: Unit:}

I had to put custom logs to understand, and the seems dropped because the Type is unspecified and they have no metadata from metricFamily.go

For an unknown reason some other metrics ending with _total are working like :

node_cpu_seconds_total{cpu="0",mode="idle"}

Steps to reproduce
I don't know exactly... Try to scrap a grafana API on https://grafana:3000/metrics

What did you expect to see?
Metrics should be kept internaly, then visible at the exporter side.

What did you see instead?
No metrics at the exporter side, and probably dropped on metrics_adjuster.go (a log with metric name is definitely missing here)

What version did you use?
0.25

@gillg
Copy link
Contributor Author

gillg commented May 6, 2021

All these dropped metrics has also negative effects at exporter level.

Each of them produce this exception on prometheus exporter. The metric seems "partialy" dropped, a reference should stay somwhere but without a valid name and type...

otel-collector    | 2021-05-06T09:56:33.720Z    error   prometheusexporter/accumulator.go:103   failed to translate metric      {"kind": "exporter", "name": "prometheus", "data_type": "\u0000", "metric_name": ""}
otel-collector    | go.opentelemetry.io/collector/exporter/prometheusexporter.(*lastValueAccumulator).addMetric
otel-collector    |     go.opentelemetry.io/collector/exporter/prometheusexporter/accumulator.go:103
otel-collector    | go.opentelemetry.io/collector/exporter/prometheusexporter.(*lastValueAccumulator).Accumulate
otel-collector    |     go.opentelemetry.io/collector/exporter/prometheusexporter/accumulator.go:74
otel-collector    | go.opentelemetry.io/collector/exporter/prometheusexporter.(*collector).processMetrics
otel-collector    |     go.opentelemetry.io/collector/exporter/prometheusexporter/collector.go:54
otel-collector    | go.opentelemetry.io/collector/exporter/prometheusexporter.(*prometheusExporter).ConsumeMetrics
otel-collector    |     go.opentelemetry.io/collector/exporter/prometheusexporter/prometheus.go:100
otel-collector    | go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsRequest).export
otel-collector    |     go.opentelemetry.io/collector/exporter/exporterhelper/metricshelper.go:57
otel-collector    | go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send
otel-collector    |     go.opentelemetry.io/collector/exporter/exporterhelper/common.go:215
otel-collector    | go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
otel-collector    |     go.opentelemetry.io/collector/exporter/exporterhelper/queued_retry.go:241
otel-collector    | go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
otel-collector    |     go.opentelemetry.io/collector/exporter/exporterhelper/metricshelper.go:120
otel-collector    | go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).send
otel-collector    |     go.opentelemetry.io/collector/exporter/exporterhelper/queued_retry.go:171
otel-collector    | go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsExporter).ConsumeMetrics
otel-collector    |     go.opentelemetry.io/collector/exporter/exporterhelper/metricshelper.go:74
otel-collector    | go.opentelemetry.io/collector/processor/batchprocessor.(*batchMetrics).export
otel-collector    |     go.opentelemetry.io/collector/processor/batchprocessor/batch_processor.go:285
otel-collector    | go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).sendItems
otel-collector    |     go.opentelemetry.io/collector/processor/batchprocessor/batch_processor.go:183
otel-collector    | go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).processItem
otel-collector    |     go.opentelemetry.io/collector/processor/batchprocessor/batch_processor.go:156
otel-collector    | go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).startProcessingCycle
otel-collector    |     go.opentelemetry.io/collector/processor/batchprocessor/batch_processor.go:141

@rakyll
Copy link
Contributor

rakyll commented May 6, 2021

cc @Aneurysm9 @alolita

@bogdandrutu
Copy link
Member

Most likely they are the new types "gauge histogram, enum or state" which we don't support

@gburton1
Copy link

We are seeing the same. Even on a counter as simple as this, which is present on the endpoint that the Prom receiver is scraping, it does not appear on the exporter side.

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 7368.08

@gillg
Copy link
Contributor Author

gillg commented May 12, 2021

Most likely they are the new types "gauge histogram, enum or state" which we don't support

I don't understand, I compared some metrics with exactly the same name on a node-exporter and on grafana internal exporter, the metric is scrapped correctly on node exporter but not on grafana.... It makes no sense.

I thought to a invalid byte somewhere on grafana, gzip compression, transfer encoding but I found nothing for now...

@Aneurysm9
Copy link
Member

I wonder if this is related to the adjustment logic being removed in open-telemetry/opentelemetry-collector#3047 that was dropping some samples. Can you try with a build from that branch, or try once it lands on the trunk?

@gillg
Copy link
Contributor Author

gillg commented May 14, 2021

Don't you think there is a link with https://github.com/open-telemetry/opentelemetry-collector/issues/2852 ? It seems a "regresssion" or something like that if it was working before march.
It seems not really important to have a good value in a counter if it's variation is consistent with the original exporter, but here I start to have some doubts. On some graphs based on cpu_time_total I have negative values.

@Aneurysm9 I didn't have the time to try the branch from you PR, I will try it asap.

@gillg
Copy link
Contributor Author

gillg commented May 17, 2021

@Aneurysm9 because this PR was merged I tried directly on master, but it's the same thing. 😞
What make me perplex is because this only appears on Grafana metrics page.

As example, on node-exporter otel-collector succeed to record

# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total{otel_job="node-exporter"} 6.332249818e+09 1621240036929

On grafana job this metric is dropped. The original value is :

# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 2.46306915e+08

Even if I only keep Grafana receiver job, the result is the same.

@gillg
Copy link
Contributor Author

gillg commented May 26, 2021

Sadly I was very happy but this not fix the issue.
This fix change the exporter accumulator, but here the issue is at the receiver level.
The exception on the exporter accumulator stay the same with this new fix until the data is "corrupted" at the receiver level.

@gillg
Copy link
Contributor Author

gillg commented May 26, 2021

@bogdandrutu can you reopen the issue please ?

@bruuuuuuuce
Copy link

@gillg I am facing a similar issue with the prometheus receiver, did you find a workaround that worked for you?

@gillg
Copy link
Contributor Author

gillg commented Jun 26, 2021

@gillg I am facing a similar issue with the prometheus receiver, did you find a workaround that worked for you?

Hello, unfortunately nothing for now. It's definitely not systematic, not a majority of metrics, but present a lot in some contexts like grafana metrics.

@bruuuuuuuce
Copy link

@gillg I am facing a similar issue with the prometheus receiver, did you find a workaround that worked for you?

Hello, unfortunately nothing for now. It's definitely not systematic, not a majority of metrics, but present a lot in some contexts like grafana metrics.

Thanks for the response! I ended up changing my metrics to add the # TYPE metric_name metric_type above each metric as per the prometheus spec https://prometheus.io/docs/instrumenting/exposition_formats/#line-format. This fixed my issue with Open Telemetry. It seems that some code does not implement this, as per prometheus it says that the TYPE is optional.

@gillg
Copy link
Contributor Author

gillg commented Jun 29, 2021

@bruuuuuuuce what is wrong here ?

# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 2.46306915e+08

It's an exemple of metric considered "untyped" but the TYPE counter is defined.

@bruuuuuuuce
Copy link

@gillg I am not sure, from my understanding of prometheus metrics, that formatting looks right to me. Why do you say that it is untyped?

@gillg
Copy link
Contributor Author

gillg commented Jun 29, 2021

Because this produce traces from the first note.
The metrics seems correctly exposed by the exporter, but someting goes wrong during the processing and we lose all metrics metadata.
It's mostly frequent on https://grafana:3000/metrics page.

@gillg
Copy link
Contributor Author

gillg commented Jun 29, 2021

Does anyone can reopen this issue please ?

@Aneurysm9
Copy link
Member

Does anyone can reopen this issue please ?

@bogdandrutu @alolita

@gillg
Copy link
Contributor Author

gillg commented Jul 28, 2021

Someone to re-open ? 🙏

I discovered a lot of new use cases where my _total metrics are dropped somwere between the receiver and the exporter.

As new exemple a set of metrics :

# HELP thanos_compact_garbage_collection_failures_total Total number of failed garbage collection operations.
# TYPE thanos_compact_garbage_collection_failures_total counter
thanos_compact_garbage_collection_failures_total 0
# HELP thanos_compact_garbage_collection_total Total number of garbage collection operations.
# TYPE thanos_compact_garbage_collection_total counter
thanos_compact_garbage_collection_total 4
# HELP thanos_compact_group_compaction_runs_completed_total Total number of group completed compaction runs. This also includes compactor group runs that resulted with no compaction.
# TYPE thanos_compact_group_compaction_runs_completed_total counter
thanos_compact_group_compaction_runs_completed_total{group="0@15509005380717446393"} 4
thanos_compact_group_compaction_runs_completed_total{group="300000@15509005380717446393"} 4
thanos_compact_group_compaction_runs_completed_total{group="3600000@15509005380717446393"} 4
# HELP thanos_compact_group_compaction_runs_started_total Total number of group compaction attempts.
# TYPE thanos_compact_group_compaction_runs_started_total counter
thanos_compact_group_compaction_runs_started_total{group="0@15509005380717446393"} 4
thanos_compact_group_compaction_runs_started_total{group="300000@15509005380717446393"} 4
thanos_compact_group_compaction_runs_started_total{group="3600000@15509005380717446393"} 4
# HELP thanos_compact_group_compactions_failures_total Total number of failed group compactions.
# TYPE thanos_compact_group_compactions_failures_total counter
thanos_compact_group_compactions_failures_total{group="0@15509005380717446393"} 0
thanos_compact_group_compactions_failures_total{group="300000@15509005380717446393"} 0
thanos_compact_group_compactions_failures_total{group="3600000@15509005380717446393"} 0
# HELP thanos_compact_group_compactions_total Total number of group compaction attempts that resulted in a new block.
# TYPE thanos_compact_group_compactions_total counter
thanos_compact_group_compactions_total{group="0@15509005380717446393"} 0
thanos_compact_group_compactions_total{group="300000@15509005380717446393"} 0
thanos_compact_group_compactions_total{group="3600000@15509005380717446393"} 0
# HELP thanos_compact_group_vertical_compactions_total Total number of group compaction attempts that resulted in a new block based on overlapping blocks.
# TYPE thanos_compact_group_vertical_compactions_total counter
thanos_compact_group_vertical_compactions_total{group="0@15509005380717446393"} 0
thanos_compact_group_vertical_compactions_total{group="300000@15509005380717446393"} 0
thanos_compact_group_vertical_compactions_total{group="3600000@15509005380717446393"} 0

@clmssz
Copy link

clmssz commented Aug 24, 2021

Hi,
Having this exact same issue on main, with random properly typed metrics, can this issue be re-opened please ?

@locmai
Copy link
Contributor

locmai commented Aug 24, 2021

I found out that the prometheusexporter is dropping some metrics on it end due to this
https://github.com/open-telemetry/opentelemetry-collector/blob/cc41009d95166f0c5ab9a9bddbab9ec903ed163d/exporter/prometheusexporter/accumulator.go#L198

To check if the metric has been dropped by prometheusexporter , remove all the processors, then add a file exporter with the JSON format to see if the data was there. If it was, then we could be sure that it has been dropped by the exporter.

Otherwise we could look at the prometheusreceiver next.

@Aneurysm9 Aneurysm9 reopened this Aug 27, 2021
@ericmustin
Copy link
Contributor

👋, just to +1, we have some end users reporting issues with either this issue or #4907. Noticed you re-opened this issue @Aneurysm9 , are you currently investigating this?

cc @mx-psi

@bogdandrutu bogdandrutu transferred this issue from open-telemetry/opentelemetry-collector Aug 30, 2021
@alolita alolita added comp:aws AWS components comp:prometheus Prometheus related issues labels Sep 2, 2021
@etiennejournet
Copy link

Hello, I also might have the same problem, running 0.37.1. @gillg have you been able to find a solution ?

See the metrics points missing :
image

In this I'm trying to scrape container_cpu_usage_seconds_total (which is a counter).

Might be unrelated but : I'm running the collector in standalone mode (single replica) and I noticed that metrics are always consistent for targets being on the same node as the collector. For targets located on another node it's erratic.

Will update.

@gillg
Copy link
Contributor Author

gillg commented Nov 1, 2021

Hello @etiennejournet for now I don't have any solution about dropped counters. But this scenario seems not match with your graph. In your case you have dropped points and not dropped entire time serie.
I suspect in your case the problem with a potential memory leak in Prometheus receiver. In this case after some hours / days the collector starts to drop metrics due to memory limit reached.

@etiennejournet
Copy link

Thanks for you answer ;)

I don't think it's the memory leak problem, I had a careful look to your other thread about that and my opentelemetry setup doesn't show any sign of refused/dropped metrics endpoints in the receiver or exporters, I don't even see the memory leak in my own monitoring.

I'm going to dig that further, thanks for your time ;)

@gillg
Copy link
Contributor Author

gillg commented Jan 11, 2022

This problem seems have disapear with the latest otel collector contrib (v0.42.0 as today).
Not sure how, but my initial guess was once the scapper was converted to internal otel metrics instead of opencensus.

Can be closed for now :)

@gillg gillg closed this as completed Jan 11, 2022
hex1848 pushed a commit to hex1848/opentelemetry-collector-contrib that referenced this issue Jun 2, 2022
…pen-telemetry#4974)

* When Run fails due to a startup issue it will set the state to closed

Signed-off-by: Corbin Phelps <[email protected]>

* Fixed changelog

Signed-off-by: Corbin Phelps <[email protected]>
@chzhuo
Copy link

chzhuo commented Jun 8, 2023

Same problem, running v0.78.0!
I am using prometheusreciver to scrap node-exporter and all metrics ending with _total be dropped

@prskr
Copy link

prskr commented Jun 15, 2023

Same here, although not all with _total are dropped but for instance container_cpu_usage_seconds_total (from cAdvisor) is dropped although container_cpu_usage_seconds is present

@gillg
Copy link
Contributor Author

gillg commented Jun 15, 2023

@chzhuo @baez90 your problem is different.
It's related to this: #20518

Discusstion related here: #21743

Workaround in 0.78.0: --feature-gates=-pkg.translator.prometheus.NormalizeName
Will be reverted by default in 0.80.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:aws AWS components comp:prometheus Prometheus related issues
Projects
None yet
Development

No branches or pull requests