Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM caused by servicegraph connector #30634

Open
wjh0914 opened this issue Jan 17, 2024 · 9 comments
Open

OOM caused by servicegraph connector #30634

wjh0914 opened this issue Jan 17, 2024 · 9 comments
Assignees
Labels
bug Something isn't working connector/servicegraph

Comments

@wjh0914
Copy link

wjh0914 commented Jan 17, 2024

Component(s)

connector/servicegraph

What happened?

Description

we try to use servicegraph connector to generate service topo,and find OOM issue

when the otel collector starts,the memeroy keeps growing:
oom

the profile shows the pmap takes lots of memory for servicegraph connector:

oom

top

Collector version

0.89.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
      otlp:
        protocols:
          grpc:
          http:
      jaeger:
        protocols:
          grpc:
          thrift_binary:
          thrift_compact:
          thrift_http:
      otlp/spanmetrics:
        protocols:
          grpc:
            endpoint: "0.0.0.0:12345"
      otlp/servicegraph:
        protocols:
          grpc:
            endpoint: "0.0.0.0:23456"
    exporters:
      logging:
        loglevel: info
      prometheus:
        endpoint: "0.0.0.0:8869"
        metric_expiration: 8760h
      prometheus/servicegraph:
        endpoint: "0.0.0.0:9090"
        metric_expiration: 8760h
      prometheusremotewrite:
        endpoint: 'http:https://vminsert-sample-vmcluster.svc.cluster.local:8480/insert/0/prometheus/api/v1/write'
        remote_write_queue:
          queue_size: 10000
          num_consumers: 5
        target_info:
          enabled: false
      otlp:
        endpoint: ats-sample-jaeger-collector.ranoss:4317
        tls:
          insecure: true
        sending_queue:
          enabled: true
          num_consumers: 20
          queue_size: 10000
    processors:
      transform:
        trace_statements:
          - context: resource
            statements:
            - replace_match(attributes["namespace"], "","unknownnamespace")
            - replace_match(attributes["apptenantname"], "","unknownapptenant")
            - replace_match(attributes["appname"], "","unknownapp")
            - replace_match(attributes["componentname"], "","unknowncomponent")
            - replace_match(attributes["podname"], "","unknownpod")
            - limit(attributes, 100, [])
            - truncate_all(attributes, 4096)
      resource:
        attributes:
          - key: apptenantname
            action: insert
            value: unknownapptenant
          - key: apptenantname
            action: update
            from_attribute: namespace
          - key: namespace
            action: insert
            value: unknownnamespace
          - key: componentname
            action: insert
            value: unknowncomponent
          - key: appname
            action: insert
            value: unknownapplication
          - key: podname
            action: insert
            value: unknownpod
      batch:
        send_batch_size: 200
        send_batch_max_size: 200
      filter/spans:
        traces:
          span:
            - 'kind != 2'
      filter/servicegraph:
        traces:
          span:
            - 'kind != 2 and kind != 3'
      spanmetrics:
        metrics_exporter: prometheus
        latency_histogram_buckets: [10ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 1400ms,  2s, 3s, 5s, 8s, 10s,12s,20s,32s,1m, 2m, 5m,10m, 30m]
        dimensions:
          - name: namespace
          - name: http.method
          - name: http.status_code
          - name: appname
          - name: componentname
          - name: podname
        dimensions_cache_size: 20000000
        metrics_flush_interval: 29s
    extensions:
      pprof:
        endpoint: '0.0.0.0:1777'
    connectors: 
      servicegraph:
        latency_histogram_buckets: [10ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 1400ms,  2s, 3s, 5s, 8s, 10s,12s,20s,32s,1m, 2m, 5m,10m, 30m]
        dimensions: [namespace,appname,componentname,podname]
        store:
          ttl: 120s
          max_items: 200000
        metrics_flush_interval: 59s
    service:
      pipelines:
        traces:
          receivers: [otlp, jaeger]
          processors: [resource, transform, batch]
          exporters: [otlp]
        metrics:
          receivers: [otlp]
          exporters: [prometheusremotewrite]
        metrics/spanmetrics:
          receivers: [otlp/spanmetrics]
          exporters: [prometheus]
        traces/spanmetrics:
          receivers: [otlp, jaeger]
          processors: [filter/spans,spanmetrics]
          exporters: [logging]
        metrics/servicegraph:
          receivers: [servicegraph]
          exporters: [prometheus/servicegraph]
        traces/servicegraph:
          receivers: [otlp, jaeger]
          processors: [filter/servicegraph]
          exporters: [servicegraph]
      extensions: [pprof]

Log output

No response

Additional context

No response

@wjh0914 wjh0914 added bug Something isn't working needs triage New item requiring triage labels Jan 17, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@Frapschen
Copy link
Contributor

relate #29762

@jpkrohling jpkrohling removed the needs triage New item requiring triage label Jan 25, 2024
@jpkrohling jpkrohling self-assigned this Jan 25, 2024
@luistilingue
Copy link

I'm with the same issue related to the serviceGraph connector :(

Copy link
Contributor

github-actions bot commented May 6, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 6, 2024
@jpkrohling jpkrohling removed the Stale label May 7, 2024
@t00mas
Copy link
Contributor

t00mas commented May 7, 2024

Assign to me please, I'll have a look at this.

@jpkrohling jpkrohling assigned t00mas and unassigned jpkrohling May 7, 2024
@t00mas
Copy link
Contributor

t00mas commented May 8, 2024

Without deep diving into your detailed use case, there are a couple of things you can try:

  • servicegraph.store.max_items seems very high - do you really need 200k edges in your servicegraph awaiting completion?
  • have you tried setting GOMEMLIMIT?

@rlankfo
Copy link
Contributor

rlankfo commented May 29, 2024

@wjh0914 does this continue happening if you remove podname from dimensions?

@Frapschen
Copy link
Contributor

you can check this metrics:

rate(otelcol_connector_servicegraph_total_edges[1m])
rate(otelcol_connector_servicegraph_expired_edges[1m])

It can help you get know about the edges in store.

@t00mas
Copy link
Contributor

t00mas commented Jun 13, 2024

I've been testing this, with some mixed results.

With the same configs and also with a pick-and-choose from there, there seems to be a slow memory creep over time, so I was able to reproduce that in a limited way.

What's more interesting is that I think it's due to the GC not running as early as possible, or waiting too much to run. I was able to make the memory consumption stable using the GOGC and GOMEMLIMIT env vars, so I advise anyone to try that too.

This is probably also a case of giving more memory to the collector instance being counterproductive, because the default GOGC is 100, possibly waiting to fill-up before triggering GC runs.

tl;dr: Didn't find a clear mem leak, but a combination of env vars GOGC << 100 and GOMEMLIMIT as a soft-limit can trigger earlier GC runs and make the mem usage stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/servicegraph
Projects
None yet
Development

No branches or pull requests

6 participants