[exporter/datadogexporter] Potential memory leak in Datadog exporter #15720

indrekj · 2022-10-31T16:35:25Z

What happened?

We have an issue where our OpenTelemetry collector (the agent) memory usage is high.

See memory usage:

When the memory usage gets high then the collector starts rejecting spans:

The traffic is not changing much during the graphs. I've also increased the memory limits ~3x without no success (the pictures use the latest and largest values). When the pod is OOM killed at some point, then it works fine until the memory gets high again.

I also ran pprof heap report:

The pprof report seems to indicate that something is up with datadog-agent Concentrator. It seems to be growing without bounds. I tried looking into the otel contrib and datadog-agent codebase, but I didn't really even figure out who exactly calls it.

Collector version

v0.63.0

Environment information

Environment

Kubernetes, running in a pod (2 replicas atm).

  resources:
    requests:
      cpu: 400m
      memory: 2048Mi
    limits:
      cpu: 800m
      memory: 2304Mi

OpenTelemetry Collector configuration

config:
    receivers:
      otlp/sidecar:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
            tls:
              cert_file: /etc/certs/opentelemetry/tls.crt
              key_file: /etc/certs/opentelemetry/tls.key
              client_ca_file: /etc/certs/opentelemetry/ca.crt
    processors:
      attributes/upsert:
        actions:
          - key: deployment.environment
            value: 'prod-us'
            action: upsert
      memory_limiter:
        check_interval: 1s
        limit_mib: 2000
        spike_limit_mib: 400
      resourcedetection:
        detectors:
          - env
        override: false
      k8sattributes: {}
      batch:
        # Datadog APM Intake limit is 3.2MB. Let's make sure the batches do not
        # go over that.
        send_batch_max_size: 1000
        send_batch_size: 100
        # This has to be exactly 10 seconds in case DataDog exporter is used:
        # https://docs.datadoghq.com/tracing/setup_overview/open_standards/#ingesting-opentelemetry-traces-with-the-collector
        timeout: 10s
    exporters:
      datadog:
        api:
          key: "${DATADOG_API_KEY}"
          site: datadoghq.com.
        traces:
          span_name_remappings:
            # From ruby services:
            opentelemetry_instrumentation_faraday.client: faraday.client
            opentelemetry_instrumentation_pg.client: pg.client
            opentelemetry_instrumentation_que.consumer: que.consumer
            opentelemetry_instrumentation_que.producer: que.producer
            opentelemetry_instrumentation_rack.server: rack.server
            opentelemetry_instrumentation_redis.client: redis.client
            opentelemetry_instrumentation_sidekiq.consumer: sidekiq.consumer
            opentelemetry_instrumentation_sidekiq.producer: sidekiq.producer
            opentelemetry_instrumentation_sinatra.server: sinatra.server

            # From elixir services:
            opentelemetry_oban.consumer: oban.consumer
            opentelemetry_oban.internal: oban.internal
            opentelemetry_oban.producer: oban.producer
            opentelemetry_tesla.client: tesla.client
            opentelemetry_phoenix.server: phoenix.server
            opentelemetry_ecto.client: ecto.client
    # Extensions are available primarily for tasks that do not involve processing telemetry data.
    # Examples of extensions include health monitoring, service discovery, and data forwarding.
    extensions:
      pprof:
        endpoint: "0.0.0.0:1777"
        block_profile_fraction: 3
        mutex_profile_fraction: 5
      memory_ballast:
        size_mib: 600
    # The service section is used to configure what components are enabled in the Collector
    # based on the configuration found in the receivers, processors, exporters, and extensions sections.
    service:
      # Extensions consist of a list of all extensions to enable.
      extensions:
        - pprof
        - memory_ballast
      # A pipeline consists of a set of receivers, processors and exporters.
      # Pipelines can be of the following types: traces, metrics and logs
      pipelines:
        traces:
          receivers:
            - otlp/sidecar
          processors:
            # The memory_limiter processor should be the first processor
            # defined in the pipeline (immediately after the receivers). This
            # is to ensure that backpressure can be sent to applicable
            # receivers and minimize the likelihood of dropped data when the
            # memory_limiter gets triggered.
            - memory_limiter
            # resourcedetection must come before k8sattributes
            - resourcedetection
            - k8sattributes
            - attributes/upsert
            - batch
          exporters:
            - datadog

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2022-11-01T08:23:24Z

Pinging code owners: @KSerrania @mx-psi @gbbr @knusbaum @amenasria @dineshg13. See Adding Labels via Comments if you do not have permissions to add labels yourself.

gbbr · 2022-11-01T09:27:05Z

Hi Indrek!

Thanks for reporting this. It's quite an interesting edge-case I've never seen before. Some additional information would help in trying to reproduce this:

A tcpdump
The output of using the fileexporter
Memory & CPU profile files

I realise it's asking for a lot, but otherwise it will be hard to reproduce.

gbbr · 2022-11-01T09:28:00Z

Additionally, for not apparent reason other than a hunch - can you try removing the span_name_remappings and see if that changes anything?

Lastly, is something missing from the yaml config you posted? I see a attributes/upsert processor being used but the definition is not there.

indrekj · 2022-11-01T10:45:48Z

I will start going through things one by one.

Lastly, is something missing from the yaml config you posted?

You're correct. We're building the config file using Helm and I overlooked a shared file. I've edited the original post. Only the attributes/upsert was missing.

indrekj · 2022-11-01T11:39:00Z

Additionally, for not apparent reason other than a hunch - can you try removing the span_name_remappings and see if that changes anything?

No change.

indrekj · 2022-11-01T12:02:28Z

Memory & CPU profile files

CPU (generated with: go tool pprof -png "localhost:1777/debug/pprof/profile?seconds=60" > ./cpu2.png)
pprof.otelcol-contrib.samples.cpu.002.pb.gz

Memory (generated with go tool pprof -png localhost:1777/debug/pprof/heap > ./heap9.png):
pprof.otelcol-contrib.alloc_objects.alloc_space.inuse_objects.inuse_space.012.pb.gz

I hope I generated the profiles correctly. I also attached the images I got with it.

indrekj · 2022-11-01T12:25:12Z

A tcpdump

Anything specific I could look into there myself? I can currently only reproduce this easily in production environment and the tcpdump would contain sensitive information, so I'd prefer not to share it.

gbbr · 2022-11-01T13:06:28Z

I see. Thanks for checking. What about the output of the fileexporter? Would that also be sensitive? Would it help at all if you would send it privately instead of posting it here?

gbbr · 2022-11-01T13:07:16Z

I see. Thanks for checking. What about the output of the fileexporter? Would that also be sensitive? Would it help at all if you would send it privately instead of posting it here?

indrekj · 2022-11-02T12:59:33Z

That might be sensitive as well yes. I haven't tried it out to see what it exactly contains.

For the tcpdump, would tcpdump port 443 be useful? That would only contain the requests between the collector and datadog VPC Endpoint.

Any tips on what would you look for from those files? I could try to analyze them myself but I'm not sure exactly what to look for.

EDIT: FYI I see traffic to api.datadoghq.com and trace.agent.datadoghq.com

indrekj · 2022-11-02T13:49:02Z

What's really weird to me is that I'm unable to find who/when starts the Concentrator. Concentrator is part of the datadog-agent package. But I don't see anything starting it or using it in the datadogexporter. Do you know what starts it and for what it is being used?

gbbr · 2022-11-03T11:08:53Z

@indrekj it's started by agnt.Run in the traces_exporter.go file. Thanks for looking into this yourself.

To share potentially sensitive data with us, I recommend reaching out via our support. Feel free to mention my name for a faster turnaround.

indrekj · 2022-11-03T14:48:53Z

@indrekj it's started by agnt.Run in the traces_exporter.go file

I'm dumb. I did a git pull against my old forked repo and that's why it didn't include that part of the code. Thanks.

I think I made some progress though. I was investigating the Contentrator flushNow and addNow functions. It seems impossible to have old buckets that are not flushed. But there's no upper bound. So theoretically you can add spans that end in the far future, and those won't be flushed until the far future.

So I looked more into the fileexporter suggestion you gave. And indeed, there are many odd spans from ingress nginx controller that have startTimeUnixNano over 7278037323140354560. That's the year 2200.

Currently, it seems that this is not Datadog issue, though it would be nice if the addNow function would log a warning if the span date is in the far future.

I'll keep the issue open until I verify that the timestamp is indeed the actual problem.

dineshg13 · 2022-11-03T17:23:43Z

Thanks @indrekj . Please us posted, would have to help to debug this further.

indrekj · 2022-11-03T17:52:27Z

I think I figured it out: #16062

I think this issue was introduced in d31a5c3. I [noticed a memory leak][1] in DataDog exporter but after a lot of debugging, it turned out that the OpenTelemetry gateway was receiving invalid timestamps: ``` ScopeSpans #0 ScopeSpans SchemaURL: InstrumentationScope Span #0 Trace ID : 0000000000000000c20a2b82c179228a Parent ID : ID : 7c3415ed370f1777 Name : [redacted] Kind : SPAN_KIND_SERVER Start time : 2200-11-16 22:32:41.14035456 +0000 UTC End time : 2200-11-16 22:33:02.68735456 +0000 UTC ``` The year is almost 100 years in the future. Before we send a span to the gateway, it goes through an OpenTelemetry collector sidecar. We use Zipkin V1 endpoint to send the spans to the sidecar. In the same pod, we're also using Zipkin V2 which did not have any issues. The problem itself: zipkin uses microseconds. This has to be multipled with 1e3 to get nanoseconds, not with 1e6. [1]: open-telemetry#15720

…estamps I think this issue was introduced in d31a5c3. I [noticed a memory leak][1] in DataDog exporter but after a lot of debugging, it turned out that the OpenTelemetry gateway was receiving invalid timestamps: ``` ScopeSpans #0 ScopeSpans SchemaURL: InstrumentationScope Span #0 Trace ID : 0000000000000000c20a2b82c179228a Parent ID : ID : 7c3415ed370f1777 Name : [redacted] Kind : SPAN_KIND_SERVER Start time : 2200-11-16 22:32:41.14035456 +0000 UTC End time : 2200-11-16 22:33:02.68735456 +0000 UTC ``` The year is almost 100 years in the future. Before we send a span to the gateway, it goes through an OpenTelemetry collector sidecar. We use Zipkin V1 endpoint to send the spans to the sidecar. In the same pod, we're also using Zipkin V2 which did not have any issues. The problem itself: zipkin uses microseconds. This has to be multipled with 1e3 to get nanoseconds, not with 1e6. [1]: open-telemetry#15720

…estamps (#16062) I think this issue was introduced in d31a5c3. I [noticed a memory leak][1] in DataDog exporter but after a lot of debugging, it turned out that the OpenTelemetry gateway was receiving invalid timestamps: ``` ScopeSpans #0 ScopeSpans SchemaURL: InstrumentationScope Span #0 Trace ID : 0000000000000000c20a2b82c179228a Parent ID : ID : 7c3415ed370f1777 Name : [redacted] Kind : SPAN_KIND_SERVER Start time : 2200-11-16 22:32:41.14035456 +0000 UTC End time : 2200-11-16 22:33:02.68735456 +0000 UTC ``` The year is almost 100 years in the future. Before we send a span to the gateway, it goes through an OpenTelemetry collector sidecar. We use Zipkin V1 endpoint to send the spans to the sidecar. In the same pod, we're also using Zipkin V2 which did not have any issues. The problem itself: zipkin uses microseconds. This has to be multipled with 1e3 to get nanoseconds, not with 1e6. [1]: #15720

mx-psi · 2022-11-08T10:07:23Z

Thanks for the PR and the investigation @indrekj 🙇 Is there anything else to be done on this issue? I guess

it would be nice if the addNow function would log a warning if the span date is in the far future.

this is something we could do on the exporter, I will defer to @gbbr's opinion on how should we handle it

…estamps (open-telemetry#16062) I think this issue was introduced in d31a5c3. I [noticed a memory leak][1] in DataDog exporter but after a lot of debugging, it turned out that the OpenTelemetry gateway was receiving invalid timestamps: ``` ScopeSpans #0 ScopeSpans SchemaURL: InstrumentationScope Span #0 Trace ID : 0000000000000000c20a2b82c179228a Parent ID : ID : 7c3415ed370f1777 Name : [redacted] Kind : SPAN_KIND_SERVER Start time : 2200-11-16 22:32:41.14035456 +0000 UTC End time : 2200-11-16 22:33:02.68735456 +0000 UTC ``` The year is almost 100 years in the future. Before we send a span to the gateway, it goes through an OpenTelemetry collector sidecar. We use Zipkin V1 endpoint to send the spans to the sidecar. In the same pod, we're also using Zipkin V2 which did not have any issues. The problem itself: zipkin uses microseconds. This has to be multipled with 1e3 to get nanoseconds, not with 1e6. [1]: open-telemetry#15720

github-actions · 2023-01-09T03:30:30Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/datadog: @KSerrania @mx-psi @gbbr @knusbaum @amenasria @dineshg13

See Adding Labels via Comments if you do not have permissions to add labels yourself.

cforce · 2023-01-25T14:33:18Z

Is this leak fixed ?

mx-psi · 2023-01-25T14:48:15Z

Is this leak fixed ?

Yes, this issue should be closed since the underlying issue on the zipkin receiver was fixed. If you are experiencing something similar please open a new issue

indrekj added bug Something isn't working needs triage New item requiring triage labels Oct 31, 2022

indrekj changed the title ~~Potential memory leak in Datadog exporter~~ [exporter/datadogexporter] Potential memory leak in Datadog exporter Nov 1, 2022

fatsheep9146 added exporter/datadog Datadog components and removed needs triage New item requiring triage labels Nov 1, 2022

mx-psi added the priority:p1 High label Nov 3, 2022

indrekj mentioned this issue Nov 3, 2022

[receiver/zipkinreceiver] Fix Zipkin V1 receiver creating invalid timestamps #16062

Merged

github-actions bot added the Stale label Jan 9, 2023

mx-psi closed this as completed Jan 25, 2023

sirianni mentioned this issue Jan 29, 2024

[exporter/datadog] Memory leak in trace stats module #30828

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporter/datadogexporter] Potential memory leak in Datadog exporter #15720

[exporter/datadogexporter] Potential memory leak in Datadog exporter #15720

indrekj commented Oct 31, 2022 •

edited

Loading

github-actions bot commented Nov 1, 2022

gbbr commented Nov 1, 2022 •

edited

Loading

gbbr commented Nov 1, 2022 •

edited

Loading

indrekj commented Nov 1, 2022

indrekj commented Nov 1, 2022

indrekj commented Nov 1, 2022

indrekj commented Nov 1, 2022

gbbr commented Nov 1, 2022

gbbr commented Nov 1, 2022

indrekj commented Nov 2, 2022 •

edited

Loading

indrekj commented Nov 2, 2022

gbbr commented Nov 3, 2022

indrekj commented Nov 3, 2022

dineshg13 commented Nov 3, 2022

indrekj commented Nov 3, 2022

mx-psi commented Nov 8, 2022

github-actions bot commented Jan 9, 2023

cforce commented Jan 25, 2023

mx-psi commented Jan 25, 2023

[exporter/datadogexporter] Potential memory leak in Datadog exporter #15720

[exporter/datadogexporter] Potential memory leak in Datadog exporter #15720

Comments

indrekj commented Oct 31, 2022 • edited Loading

What happened?

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Nov 1, 2022

gbbr commented Nov 1, 2022 • edited Loading

gbbr commented Nov 1, 2022 • edited Loading

indrekj commented Nov 1, 2022

indrekj commented Nov 1, 2022

indrekj commented Nov 1, 2022

indrekj commented Nov 1, 2022

gbbr commented Nov 1, 2022

gbbr commented Nov 1, 2022

indrekj commented Nov 2, 2022 • edited Loading

indrekj commented Nov 2, 2022

gbbr commented Nov 3, 2022

indrekj commented Nov 3, 2022

dineshg13 commented Nov 3, 2022

indrekj commented Nov 3, 2022

mx-psi commented Nov 8, 2022

github-actions bot commented Jan 9, 2023

cforce commented Jan 25, 2023

mx-psi commented Jan 25, 2023

indrekj commented Oct 31, 2022 •

edited

Loading

gbbr commented Nov 1, 2022 •

edited

Loading

gbbr commented Nov 1, 2022 •

edited

Loading

indrekj commented Nov 2, 2022 •

edited

Loading