Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheusremotewrite exporter with histogram is causing metrics export failure due to high memory (90%) #30675

Open
bhupeshpadiyar opened this issue Jan 19, 2024 · 5 comments
Labels
bug Something isn't working exporter/prometheusremotewrite needs triage New item requiring triage Stale

Comments

@bhupeshpadiyar
Copy link

Component(s)

exporter/prometheusremotewrite

What happened?

Description

Collector memory and CPU usages are spiking while exporting histogram metrics with prometheusremotewrite exporter and causing metrics export failure with following errors logs.

2024-01-17T10:33:50.902Z info [email protected]/memorylimiter.go:287 Memory usage is above soft limit. Forcing a GC. {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 16085}

2024-01-17T10:50:08.919Z error scrape/scrape.go:1351 Scrape commit failed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "otel-collector", "target": "http:https://0.0.0.0:8888/metrics", "error": "data refused due to high memory usage"}

2024-01-17T15:07:12.464Z warn [email protected]/memorylimiter.go:294 Memory usage is above soft limit. Refusing data. {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 16081}

Note - This issue is happening only with histogram metrics export only. This exporter works fine with counter and gauge type metrics

Steps to Reproduce

  • Instrument histogram metrics using OTEL-Java SDK
  • Monitor the collector health and other metrics
  • Initially it exports all histogram metric seamlessly
  • After few minutes, collector memory, CPU, exporter queue size spikes and start getting memory usages error as mentioned above in the sample logs.
  • And the metrics export is getting failed.

Expected Result

All type metrics (counter, gauge, histogram) should export seamlessly, without error.

Actual Result

Causing high memory usages error and exporting failed with histogram metrics.

2024-01-17T10:33:50.902Z info [email protected]/memorylimiter.go:287 Memory usage is above soft limit. Forcing a GC. {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 16085}

2024-01-17T10:50:08.919Z error scrape/scrape.go:1351 Scrape commit failed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "otel-collector", "target": "http:https://0.0.0.0:8888/metrics", "error": "data refused due to high memory usage"}

2024-01-17T15:07:12.464Z warn [email protected]/memorylimiter.go:294 Memory usage is above soft limit. Refusing data. {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 16081}
CollectorMemoryCPU ExporterQUEUE BatchMetrics MetricsPointRate

Collector version

0.92.0 (Confirmed with older collector versions as well)

Environment information

#Environment
OS: Linux/ARM64 - AWS ECS (Fargate) Cluster

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4318

  # Collect own metrics
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 10s
          static_configs:
            - targets: [ '0.0.0.0:8888' ]
exporters:
  logging:
    verbosity: "basic"
    sampling_initial: 5
  prometheusremotewrite:
    endpoint: "http:https://<victoria-metrics-instance>:8428/prometheus/api/v1/write"
    tls:
      insecure: true
processors:
  batch:
    timeout:
    send_batch_size:
    send_batch_max_size:
  memory_limiter:
    check_interval: 200ms
    limit_mib: 20000
    spike_limit_mib: 4000
extensions:
  health_check:
  pprof:
    endpoint: 0.0.0.0:1888
    block_profile_fraction: 3
    mutex_profile_fraction: 5
  zpages:
    endpoint: 0.0.0.0:55679
service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [logging, prometheusremotewrite]

Log output

2024-01-17T10:33:50.902Z info [email protected]/memorylimiter.go:287 Memory usage is above soft limit. Forcing a GC. {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 16085}

2024-01-17T10:50:08.919Z error scrape/scrape.go:1351 Scrape commit failed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "otel-collector", "target": "http:https://0.0.0.0:8888/metrics", "error": "data refused due to high memory usage"}

2024-01-17T15:07:12.464Z warn [email protected]/memorylimiter.go:294 Memory usage is above soft limit. Refusing data. {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 16081}


### Additional context

This issue is happening only with histogram metrics export only and the exporter works fine with counter and gauge type metrics
@bhupeshpadiyar bhupeshpadiyar added bug Something isn't working needs triage New item requiring triage labels Jan 19, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

Note: This is possibly a duplicate of #24405

@bhupeshpadiyar
Copy link
Author

Hi @crobert-1 ,

Just to inform you, we are facing this issue with the histogram metrics but the linked issue seems related to Exponential histogram metrics.

In our case, we are getting this error in the case of exporting Histogram type metrics.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/prometheusremotewrite needs triage New item requiring triage Stale
Projects
None yet
Development

No branches or pull requests

2 participants