Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Servicegraphconnector cleanup leads to "failed to find dimensions for key" errors #31701

Closed
samwright opened this issue Mar 12, 2024 · 4 comments
Assignees
Labels
bug Something isn't working connector/servicegraph

Comments

@samwright
Copy link
Contributor

Component(s)

connector/servicegraph

What happened?

Description

I'm seeing a about 10 log lines per hour like failed to find dimensions for key when using servicegraph. Looking around the code a bit, I think I found an issue, that might be the cause:

The cleanCache function cleans up series that haven't been used in 15mins. It does this by:

  1. Deleting it from p.keyToMetric (which holds the dimensions for the metric), then
  2. Deleting it from all the metric series maps, e.g. p.reqTotal

In parallel, the metrics are collected by:

  1. Looping through all items in the metric series maps, e.g. p.reqTotal
  2. For each, gets the metric's dimensions from p.keyToMetric

Because these occur in opposite orders, we can get into a sticky situation where the collector function errors out when getting the metric's dimensions.

I can see a few ways around this:

  1. Reverse the order of the operations in the cleanup script
  2. make the collection functions skip series where their dimensions have already been cleaned up

Steps to Reproduce

Kinda tricky, since it's a race condition...

Expected Result

No errors logged, all metrics collected and exported.

Actual Result

Errors logged like:

failed to build metrics: failed to find dimensions for key ...

and presumably some metrics not being exported (since the collection function will return when it encounters this error, skipping subsequent metric series).

Collector version

v0.96.0

Environment information

Environment

Using the opentelemetry-collector-contrib image.

OpenTelemetry Collector configuration

processors:
  groupbytrace:
    wait_duration: 30s

connectors:
  servicegraph:
    store:
      ttl: 0s
      max_items: 10000
    metrics_flush_interval: 30s
    dimensions:
      - namespace
      - app
      - team
      - http.route
      - http.method

service:
  pipelines:
    traces/generate_servicegraph:
      receivers:
        - otlp/by-traceid
      processors:
        - memory_limiter
        - groupbytrace
        - transform/copy_span_name_to_attribute
      exporters:
        - servicegraph

    metrics/from_servicegraph:
      receivers:
        - servicegraph
      processors:
        - batch
      exporters:
        - otlphttp/mimir

Log output

message	
[email protected]/connector.go:195	failed to flush metrics	{"kind": "connector", "name": "servicegraph", "exporter_in_pipeline": "traces", "receiver_in_pipeline": "metrics", "error": "failed to build metrics: failed to find dimensions for key <client_name>\u0000<server_name>\u0000"}

Additional context

No response

@samwright samwright added bug Something isn't working needs triage New item requiring triage labels Mar 12, 2024
@samwright
Copy link
Contributor Author

I've had a quick go at approach 1 here: #31700

Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

Removing needs triage, PR posted has been approved by code owner.

Thanks for your help here @samwright!

@samwright
Copy link
Contributor Author

Merged and released 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/servicegraph
Projects
None yet
Development

No branches or pull requests

2 participants