Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/spanmetrics] Fix concurrency bug causing premature key eviction #11149

Merged
merged 14 commits into from
Jul 12, 2022

Conversation

albertteoh
Copy link
Contributor

@albertteoh albertteoh commented Jun 18, 2022

Description: The processor creates and caches keys used to lookup the metrics dimensions; then, expecting it to be present in the cache, attempts to fetch this cache key. This occasionally results in an error: value not found in metricKeyToDimensions cache by key.

The cause of this error is due to locks not covering the entire flow above leading to race conditions where the cache key is prematurely evicted prior to its access. Moreover, caching the key uses ContainsOrAdd which does not update the recent-ness of the key if the key exists, exacerbating the problem leading to a higher frequency of such occurrences. ContainsOrAdd documentation:

// ContainsOrAdd checks if a key is in the cache without updating the
// recent-ness or deleting it for being stale, and if not, adds the value.
// Returns whether found and whether an eviction occurred.

This results in two problems:

  1. A large volume of error logs.
  2. The more serious problem where the trace is not propagated down the trace pipeline, leading to lost spans.

To solve the above problems, there are three key themes of change in this PR:

  1. Change ContainsOrAdd to two separate calls:
    1. Get to check if the key exists. If the key exists, its recent-ness will be updated.
    2. If the key does not exist, Add the key to cache.
  2. Simplify the locking model:
    1. Remove locks from the internal cache.
    2. Isolate all locking/unlocking to a single function and only do it once.
  3. Avoid breaking, or slowing down, the flow of trace data:
    1. Execute the trace to metrics aggregation and metrics emission within a goroutine.
    2. Log errors instead of returning errors upstream.

How this bug was reproduced:

  • Place a 1 second sleep in aggregateMetrics
  • Write a test that spins up 2 goroutines:
    • The first goroutine that sends a trace with 2 spans with the following names: "service-a" and "service-b"
    • The second goroutine that sends a trace with 2 spans with the following names: "service-a" and "service-c"
  • This causes "service-a"'s key to be evicted by the introduction of "service-c", resulting in the error:
    • value not found in metricKeyToDimensions cache by key

Link to tracking Issue: #9018

Testing:

  • Update unit tests to pass.
  • Build OTEL contrib docker image locally and run the entire Jaeger SPM stack locally to confirm metrics are appearing in Jaeger UI's Monitor page.

Screen Shot 2022-06-18 at 8 11 13 pm

Documentation: No new documentation required for this.

@albertteoh albertteoh requested a review from a team as a code owner June 18, 2022 10:46
@albertteoh
Copy link
Contributor Author

All build checks have been addressed and ready for review.

@albertteoh
Copy link
Contributor Author

@open-telemetry/collector-contrib-approvers PTAL

Copy link

@greut greut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to go with a sync.Map rather than manual mutex management?

@albertteoh
Copy link
Contributor Author

would it make sense to go with a sync.Map rather than manual mutex management?

Thanks for the suggestion, @greut 😄

Which map instance were you thinking of replacing with a sync.Map?

The primary motivation for this PR is to ensure serial execution when aggregating metrics and then building the final set of metrics for emitting which depends on the existence of these keys in the cache. The reason for needing this guarantee is because keys could be evicted due to races, even though the underlying LRU cache is thread-safe and even if we were to use sync.Map for all the maps.

@greut
Copy link

greut commented Jun 26, 2022

@albertteoh sorry. I'm realizing that I read it way too quicky and see now that the sync part is removed around the evicted keys.

@albertteoh albertteoh changed the title Fix concurrency bug causing premature key eviction [processor/spanmetrics] Fix concurrency bug causing premature key eviction Jul 3, 2022
Copy link
Member

@TylerHelmuth TylerHelmuth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit, otherwise LGTM

if err != nil {
p.logger.Error(err.Error())
} else if err = p.metricsExporter.ConsumeMetrics(ctx, *m); err != nil {
// Export metrics first before forwarding trace to avoid being impacted by downstream trace processor errors/latency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still accurate? Aren't traces being forward continuously now that this is in a goroutine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good pickup, @TylerHelmuth. Addressed in this PR: #12427

@TylerHelmuth
Copy link
Member

@codeboten @dmitryax please review.

@amoscatelli
Copy link

Any luck in merging this ?
Rebooting otel collector everyday is really a problem ...

@dmitryax dmitryax merged commit 2d59782 into open-telemetry:main Jul 12, 2022
@amoscatelli
Copy link

thank you !
hoping v56 is out soon

@albertteoh albertteoh deleted the 9018-fix-concurrency-bug branch July 14, 2022 12:33
atoulme pushed a commit to atoulme/opentelemetry-collector-contrib that referenced this pull request Jul 16, 2022
…ction (open-telemetry#11149)

* Fix premature key eviction

* Cleanup

* Cleanup

Signed-off-by: albertteoh <[email protected]>

* Add changelog entry

Signed-off-by: albertteoh <[email protected]>

* Fix flaky test

Signed-off-by: albertteoh <[email protected]>

* Fix lint error: spelling

Signed-off-by: albertteoh <[email protected]>

* Fix flaky test

* Fix lint spelling error: behaviour -> behavior

Signed-off-by: albertteoh <[email protected]>

* Prefer simpler Mutex

* Remove incorrectly added changelog entry

Signed-off-by: albertteoh <[email protected]>

* Add changelog file as per PR creation instructions

Signed-off-by: albertteoh <[email protected]>
@amoscatelli
Copy link

spanmetricsprocessor seems to be broken in 0.56, maybe this issue fix is related ?

@dixanms
Copy link

dixanms commented Jul 25, 2022

Before v0.56.0, our otel collectors configured with the spanmetrics processor on traces pipeline, stopped exporting traces AND stopped exporting span metrics after encountering the error

value not found in metricKeyToDimensions cache

After upgrading to v0.56.0, when that error occurs, traces seem to continue exporting correctly but it stops exporting span metrics.
Here is a piece of the otel collector log:

Jul 25 19:49:36 laas-0afba94e-fbf8-4a13-b906-948f4773e1a1.platform.comcast.net otelcol-contrib[31779]: 2022-07-25T19:49:36.747Z        error        [email protected]/processor.go:243        value not found in metricKeyToDimensions cache by key "analytics-prod-amw2-g2\x00HTTP POST\x00SPAN_KIND_CLIENT\x00STATUS_CODE_ERROR\x00POST"        {"kind": "processor", "name": "spanmetrics", "pipeline": "traces"}
Jul 25 19:49:36 laas-0afba94e-fbf8-4a13-b906-948f4773e1a1.platform.comcast.net otelcol-contrib[31779]: github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces.func1
Jul 25 19:49:36 laas-0afba94e-fbf8-4a13-b906-948f4773e1a1.platform.comcast.net otelcol-contrib[31779]: github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:243

@albertteoh
Copy link
Contributor Author

albertteoh commented Jul 26, 2022

Thanks for reporting the bug @dixanms.

traces seem to continue exporting correctly but it stops exporting span metrics.

Yes, that makes sense because metrics and spans are processed in separate threads in the new design. I would not expect this to happen for every span, but just for some spans (let me know if this is not the case for you), which I explain below...

I suspect this is caused by the metric keys being evicted from the LRU cache, particularly when you have a large number of heterogeneous spans a batch of traces, more than what the cache can hold.

You could try increasing the cache size in config as a temporary workaround. Example:

A proper fix would be to loop over the cache's keys which has the size limit imposed by the above config, rather than the metric data keys which is simply an unbounded map.

I'll do a bit of testing to prove this theory before putting together a fix.

@amoscatelli
Copy link

dimensions_cache_size

We still have the issue.
The workaround, as you suggested was to largely increase the dimensions_cache_size, hoping this is stable enough ...

Please let us know about testing your theory.

Thank you.

@amoscatelli
Copy link

Before v0.56.0, our otel collectors configured with the spanmetrics processor on traces pipeline, stopped exporting traces AND stopped exporting span metrics after encountering the error

value not found in metricKeyToDimensions cache

After upgrading to v0.56.0, when that error occurs, traces seem to continue exporting correctly but it stops exporting span metrics. Here is a piece of the otel collector log:

Jul 25 19:49:36 laas-0afba94e-fbf8-4a13-b906-948f4773e1a1.platform.comcast.net otelcol-contrib[31779]: 2022-07-25T19:49:36.747Z        error        [email protected]/processor.go:243        value not found in metricKeyToDimensions cache by key "analytics-prod-amw2-g2\x00HTTP POST\x00SPAN_KIND_CLIENT\x00STATUS_CODE_ERROR\x00POST"        {"kind": "processor", "name": "spanmetrics", "pipeline": "traces"}
Jul 25 19:49:36 laas-0afba94e-fbf8-4a13-b906-948f4773e1a1.platform.comcast.net otelcol-contrib[31779]: github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces.func1
Jul 25 19:49:36 laas-0afba94e-fbf8-4a13-b906-948f4773e1a1.platform.comcast.net otelcol-contrib[31779]: github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:243

We have the same behaviour, with 0.57.2

That error is returned by the trace endpoint and spanmetrics are sent no more.

@amoscatelli
Copy link

amoscatelli commented Aug 5, 2022

@albertteoh Should the issue be reopened or should we create a new one ?

@albertteoh
Copy link
Contributor Author

We have the same behaviour, with 0.57.2

Thanks for informing, @amoscatelli.

0.57.2 should fix any panics caused by race conditions but I would not expect the above value not found in metricKeyToDimensions cache bug to be fixed yet, if my theory is correct.

@albertteoh Should the issue be reopened or should we create a new one ?

I can create the issue and work on the fix over the weekend; but feel free to create it if you like.

I'm curious, did increasing dimensions_cache_size improve the situation (i.e. fewer cases of the "value not found in metricKeyToDimensions cache" errors)? This clue would support the theory that the cache has reached capacity and evicting keys in the same trace batch.

@amoscatelli
Copy link

I con confirm this only this evening.
I need more time to pass and more keys/metrics to be collected.

I'll inform you soon.

@amoscatelli
Copy link

Sorry but the EC2 machine I am running collector on went unresponsive.

It seemed like the behaviour became stabler, maybe it went down for a memory shartage but I cannot be sure.

I'll try to investigate ... sadly this makes spanmetrics still unusable with additional dimensions ....

@amoscatelli
Copy link

I increased allocated memory to both docker image and host machine .... let's see if this change something ... 😢

ag-ramachandran referenced this pull request in ag-ramachandran/opentelemetry-collector-contrib Sep 15, 2022
…ction (#11149)

* Fix premature key eviction

* Cleanup

* Cleanup

Signed-off-by: albertteoh <[email protected]>

* Add changelog entry

Signed-off-by: albertteoh <[email protected]>

* Fix flaky test

Signed-off-by: albertteoh <[email protected]>

* Fix lint error: spelling

Signed-off-by: albertteoh <[email protected]>

* Fix flaky test

* Fix lint spelling error: behaviour -> behavior

Signed-off-by: albertteoh <[email protected]>

* Prefer simpler Mutex

* Remove incorrectly added changelog entry

Signed-off-by: albertteoh <[email protected]>

* Add changelog file as per PR creation instructions

Signed-off-by: albertteoh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants