OOM caused by servicegraph connector #30634

wjh0914 · 2024-01-17T02:28:36Z

Component(s)

connector/servicegraph

What happened?

Description

we try to use servicegraph connector to generate service topo，and find OOM issue

when the otel collector starts，the memeroy keeps growing:

the profile shows the pmap takes lots of memory for servicegraph connector:

Collector version

0.89.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
      otlp:
        protocols:
          grpc:
          http:
      jaeger:
        protocols:
          grpc:
          thrift_binary:
          thrift_compact:
          thrift_http:
      otlp/spanmetrics:
        protocols:
          grpc:
            endpoint: "0.0.0.0:12345"
      otlp/servicegraph:
        protocols:
          grpc:
            endpoint: "0.0.0.0:23456"
    exporters:
      logging:
        loglevel: info
      prometheus:
        endpoint: "0.0.0.0:8869"
        metric_expiration: 8760h
      prometheus/servicegraph:
        endpoint: "0.0.0.0:9090"
        metric_expiration: 8760h
      prometheusremotewrite:
        endpoint: 'http:https://vminsert-sample-vmcluster.svc.cluster.local:8480/insert/0/prometheus/api/v1/write'
        remote_write_queue:
          queue_size: 10000
          num_consumers: 5
        target_info:
          enabled: false
      otlp:
        endpoint: ats-sample-jaeger-collector.ranoss:4317
        tls:
          insecure: true
        sending_queue:
          enabled: true
          num_consumers: 20
          queue_size: 10000
    processors:
      transform:
        trace_statements:
          - context: resource
            statements:
            - replace_match(attributes["namespace"], "","unknownnamespace")
            - replace_match(attributes["apptenantname"], "","unknownapptenant")
            - replace_match(attributes["appname"], "","unknownapp")
            - replace_match(attributes["componentname"], "","unknowncomponent")
            - replace_match(attributes["podname"], "","unknownpod")
            - limit(attributes, 100, [])
            - truncate_all(attributes, 4096)
      resource:
        attributes:
          - key: apptenantname
            action: insert
            value: unknownapptenant
          - key: apptenantname
            action: update
            from_attribute: namespace
          - key: namespace
            action: insert
            value: unknownnamespace
          - key: componentname
            action: insert
            value: unknowncomponent
          - key: appname
            action: insert
            value: unknownapplication
          - key: podname
            action: insert
            value: unknownpod
      batch:
        send_batch_size: 200
        send_batch_max_size: 200
      filter/spans:
        traces:
          span:
            - 'kind != 2'
      filter/servicegraph:
        traces:
          span:
            - 'kind != 2 and kind != 3'
      spanmetrics:
        metrics_exporter: prometheus
        latency_histogram_buckets: [10ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 1400ms,  2s, 3s, 5s, 8s, 10s,12s,20s,32s,1m, 2m, 5m,10m, 30m]
        dimensions:
          - name: namespace
          - name: http.method
          - name: http.status_code
          - name: appname
          - name: componentname
          - name: podname
        dimensions_cache_size: 20000000
        metrics_flush_interval: 29s
    extensions:
      pprof:
        endpoint: '0.0.0.0:1777'
    connectors: 
      servicegraph:
        latency_histogram_buckets: [10ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 1400ms,  2s, 3s, 5s, 8s, 10s,12s,20s,32s,1m, 2m, 5m,10m, 30m]
        dimensions: [namespace,appname,componentname,podname]
        store:
          ttl: 120s
          max_items: 200000
        metrics_flush_interval: 59s
    service:
      pipelines:
        traces:
          receivers: [otlp, jaeger]
          processors: [resource, transform, batch]
          exporters: [otlp]
        metrics:
          receivers: [otlp]
          exporters: [prometheusremotewrite]
        metrics/spanmetrics:
          receivers: [otlp/spanmetrics]
          exporters: [prometheus]
        traces/spanmetrics:
          receivers: [otlp, jaeger]
          processors: [filter/spans,spanmetrics]
          exporters: [logging]
        metrics/servicegraph:
          receivers: [servicegraph]
          exporters: [prometheus/servicegraph]
        traces/servicegraph:
          receivers: [otlp, jaeger]
          processors: [filter/servicegraph]
          exporters: [servicegraph]
      extensions: [pprof]

Log output

No response

Additional context

No response

github-actions · 2024-01-17T02:28:56Z

Pinging code owners:

connector/servicegraph: @jpkrohling @mapno

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Frapschen · 2024-01-23T10:41:46Z

relate #29762

luistilingue · 2024-03-06T23:48:08Z

I'm with the same issue related to the serviceGraph connector :(

github-actions · 2024-05-06T03:30:19Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

connector/servicegraph: @jpkrohling @mapno

See Adding Labels via Comments if you do not have permissions to add labels yourself.

t00mas · 2024-05-07T13:01:32Z

Assign to me please, I'll have a look at this.

t00mas · 2024-05-08T16:06:30Z

Without deep diving into your detailed use case, there are a couple of things you can try:

servicegraph.store.max_items seems very high - do you really need 200k edges in your servicegraph awaiting completion?
have you tried setting GOMEMLIMIT?

rlankfo · 2024-05-29T12:31:24Z

@wjh0914 does this continue happening if you remove podname from dimensions?

Frapschen · 2024-05-30T07:13:55Z

you can check this metrics:

rate(otelcol_connector_servicegraph_total_edges[1m])
rate(otelcol_connector_servicegraph_expired_edges[1m])

It can help you get know about the edges in store.

t00mas · 2024-06-13T13:25:04Z

I've been testing this, with some mixed results.

With the same configs and also with a pick-and-choose from there, there seems to be a slow memory creep over time, so I was able to reproduce that in a limited way.

What's more interesting is that I think it's due to the GC not running as early as possible, or waiting too much to run. I was able to make the memory consumption stable using the GOGC and GOMEMLIMIT env vars, so I advise anyone to try that too.

This is probably also a case of giving more memory to the collector instance being counterproductive, because the default GOGC is 100, possibly waiting to fill-up before triggering GC runs.

tl;dr: Didn't find a clear mem leak, but a combination of env vars GOGC << 100 and GOMEMLIMIT as a soft-limit can trigger earlier GC runs and make the mem usage stable.

wjh0914 added bug Something isn't working needs triage New item requiring triage labels Jan 17, 2024

github-actions bot added the connector/servicegraph label Jan 17, 2024

github-actions bot mentioned this issue Jan 23, 2024

Weekly Report: 2024-01-16 - 2024-01-23 #30711

Closed

jpkrohling removed the needs triage New item requiring triage label Jan 25, 2024

jpkrohling self-assigned this Jan 25, 2024

github-actions bot added the Stale label May 6, 2024

jpkrohling removed the Stale label May 7, 2024

jpkrohling assigned t00mas and unassigned jpkrohling May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM caused by servicegraph connector #30634

OOM caused by servicegraph connector #30634

wjh0914 commented Jan 17, 2024 •

edited

Loading

github-actions bot commented Jan 17, 2024

Frapschen commented Jan 23, 2024

luistilingue commented Mar 6, 2024

github-actions bot commented May 6, 2024

t00mas commented May 7, 2024

t00mas commented May 8, 2024

rlankfo commented May 29, 2024

Frapschen commented May 30, 2024

t00mas commented Jun 13, 2024 •

edited

Loading

OOM caused by servicegraph connector #30634

OOM caused by servicegraph connector #30634

Comments

wjh0914 commented Jan 17, 2024 • edited Loading

Component(s)

What happened?

Description

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Jan 17, 2024

Frapschen commented Jan 23, 2024

luistilingue commented Mar 6, 2024

github-actions bot commented May 6, 2024

t00mas commented May 7, 2024

t00mas commented May 8, 2024

rlankfo commented May 29, 2024

Frapschen commented May 30, 2024

t00mas commented Jun 13, 2024 • edited Loading

wjh0914 commented Jan 17, 2024 •

edited

Loading

t00mas commented Jun 13, 2024 •

edited

Loading