[processor/spanmetrics] Fix concurrency bug causing premature key eviction #11149

albertteoh · 2022-06-18T10:46:27Z

Description: The processor creates and caches keys used to lookup the metrics dimensions; then, expecting it to be present in the cache, attempts to fetch this cache key. This occasionally results in an error: value not found in metricKeyToDimensions cache by key.

The cause of this error is due to locks not covering the entire flow above leading to race conditions where the cache key is prematurely evicted prior to its access. Moreover, caching the key uses ContainsOrAdd which does not update the recent-ness of the key if the key exists, exacerbating the problem leading to a higher frequency of such occurrences. ContainsOrAdd documentation:

// ContainsOrAdd checks if a key is in the cache without updating the
// recent-ness or deleting it for being stale, and if not, adds the value.
// Returns whether found and whether an eviction occurred.

This results in two problems:

A large volume of error logs.
The more serious problem where the trace is not propagated down the trace pipeline, leading to lost spans.

To solve the above problems, there are three key themes of change in this PR:

Change ContainsOrAdd to two separate calls:
1. Get to check if the key exists. If the key exists, its recent-ness will be updated.
2. If the key does not exist, Add the key to cache.
Simplify the locking model:
1. Remove locks from the internal cache.
2. Isolate all locking/unlocking to a single function and only do it once.
Avoid breaking, or slowing down, the flow of trace data:
1. Execute the trace to metrics aggregation and metrics emission within a goroutine.
2. Log errors instead of returning errors upstream.

How this bug was reproduced:

Place a 1 second sleep in aggregateMetrics
Write a test that spins up 2 goroutines:
- The first goroutine that sends a trace with 2 spans with the following names: "service-a" and "service-b"
- The second goroutine that sends a trace with 2 spans with the following names: "service-a" and "service-c"
This causes "service-a"'s key to be evicted by the introduction of "service-c", resulting in the error:
- value not found in metricKeyToDimensions cache by key

Link to tracking Issue: #9018

Testing:

Update unit tests to pass.
Build OTEL contrib docker image locally and run the entire Jaeger SPM stack locally to confirm metrics are appearing in Jaeger UI's Monitor page.

Documentation: No new documentation required for this.

Signed-off-by: albertteoh <[email protected]>

albertteoh · 2022-06-18T21:31:51Z

All build checks have been addressed and ready for review.

albertteoh · 2022-06-23T10:49:26Z

@open-telemetry/collector-contrib-approvers PTAL

greut

would it make sense to go with a sync.Map rather than manual mutex management?

albertteoh · 2022-06-26T11:27:05Z

would it make sense to go with a sync.Map rather than manual mutex management?

Thanks for the suggestion, @greut 😄

Which map instance were you thinking of replacing with a sync.Map?

The primary motivation for this PR is to ensure serial execution when aggregating metrics and then building the final set of metrics for emitting which depends on the existence of these keys in the cache. The reason for needing this guarantee is because keys could be evicted due to races, even though the underlying LRU cache is thread-safe and even if we were to use sync.Map for all the maps.

greut · 2022-06-26T12:57:19Z

@albertteoh sorry. I'm realizing that I read it way too quicky and see now that the sync part is removed around the evicted keys.

Signed-off-by: albertteoh <[email protected]>

TylerHelmuth

Small nit, otherwise LGTM

TylerHelmuth · 2022-07-07T16:21:35Z

processor/spanmetricsprocessor/processor.go

+ if err != nil {
+ p.logger.Error(err.Error())
+ } else if err = p.metricsExporter.ConsumeMetrics(ctx, *m); err != nil {
+ // Export metrics first before forwarding trace to avoid being impacted by downstream trace processor errors/latency.


Is this comment still accurate? Aren't traces being forward continuously now that this is in a goroutine?

Thanks, good pickup, @TylerHelmuth. Addressed in this PR: #12427

TylerHelmuth · 2022-07-07T16:33:46Z

@codeboten @dmitryax please review.

amoscatelli · 2022-07-12T07:39:40Z

Any luck in merging this ?
Rebooting otel collector everyday is really a problem ...

amoscatelli · 2022-07-12T16:43:37Z

thank you !
hoping v56 is out soon

…ction (open-telemetry#11149) * Fix premature key eviction * Cleanup * Cleanup Signed-off-by: albertteoh <[email protected]> * Add changelog entry Signed-off-by: albertteoh <[email protected]> * Fix flaky test Signed-off-by: albertteoh <[email protected]> * Fix lint error: spelling Signed-off-by: albertteoh <[email protected]> * Fix flaky test * Fix lint spelling error: behaviour -> behavior Signed-off-by: albertteoh <[email protected]> * Prefer simpler Mutex * Remove incorrectly added changelog entry Signed-off-by: albertteoh <[email protected]> * Add changelog file as per PR creation instructions Signed-off-by: albertteoh <[email protected]>

amoscatelli · 2022-07-22T18:30:01Z

spanmetricsprocessor seems to be broken in 0.56, maybe this issue fix is related ?

dixanms · 2022-07-25T20:22:42Z

Before v0.56.0, our otel collectors configured with the spanmetrics processor on traces pipeline, stopped exporting traces AND stopped exporting span metrics after encountering the error

value not found in metricKeyToDimensions cache

After upgrading to v0.56.0, when that error occurs, traces seem to continue exporting correctly but it stops exporting span metrics.
Here is a piece of the otel collector log:

Jul 25 19:49:36 laas-0afba94e-fbf8-4a13-b906-948f4773e1a1.platform.comcast.net otelcol-contrib[31779]: 2022-07-25T19:49:36.747Z        error        [email protected]/processor.go:243        value not found in metricKeyToDimensions cache by key "analytics-prod-amw2-g2\x00HTTP POST\x00SPAN_KIND_CLIENT\x00STATUS_CODE_ERROR\x00POST"        {"kind": "processor", "name": "spanmetrics", "pipeline": "traces"}
Jul 25 19:49:36 laas-0afba94e-fbf8-4a13-b906-948f4773e1a1.platform.comcast.net otelcol-contrib[31779]: github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces.func1
Jul 25 19:49:36 laas-0afba94e-fbf8-4a13-b906-948f4773e1a1.platform.comcast.net otelcol-contrib[31779]: github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:243

albertteoh · 2022-07-26T11:56:59Z

Thanks for reporting the bug @dixanms.

traces seem to continue exporting correctly but it stops exporting span metrics.

Yes, that makes sense because metrics and spans are processed in separate threads in the new design. I would not expect this to happen for every span, but just for some spans (let me know if this is not the case for you), which I explain below...

I suspect this is caused by the metric keys being evicted from the LRU cache, particularly when you have a large number of heterogeneous spans a batch of traces, more than what the cache can hold.

You could try increasing the cache size in config as a temporary workaround. Example:

opentelemetry-collector-contrib/processor/spanmetricsprocessor/testdata/config-full.yaml

Line 39 in 471b2fb

dimensions_cache_size: 1500

A proper fix would be to loop over the cache's keys which has the size limit imposed by the above config, rather than the metric data keys which is simply an unbounded map.

I'll do a bit of testing to prove this theory before putting together a fix.

amoscatelli · 2022-08-05T10:36:10Z

dimensions_cache_size

We still have the issue.
The workaround, as you suggested was to largely increase the dimensions_cache_size, hoping this is stable enough ...

Please let us know about testing your theory.

Thank you.

amoscatelli · 2022-08-05T10:37:38Z

Before v0.56.0, our otel collectors configured with the spanmetrics processor on traces pipeline, stopped exporting traces AND stopped exporting span metrics after encountering the error

value not found in metricKeyToDimensions cache

After upgrading to v0.56.0, when that error occurs, traces seem to continue exporting correctly but it stops exporting span metrics. Here is a piece of the otel collector log:

Jul 25 19:49:36 laas-0afba94e-fbf8-4a13-b906-948f4773e1a1.platform.comcast.net otelcol-contrib[31779]: 2022-07-25T19:49:36.747Z        error        [email protected]/processor.go:243        value not found in metricKeyToDimensions cache by key "analytics-prod-amw2-g2\x00HTTP POST\x00SPAN_KIND_CLIENT\x00STATUS_CODE_ERROR\x00POST"        {"kind": "processor", "name": "spanmetrics", "pipeline": "traces"}
Jul 25 19:49:36 laas-0afba94e-fbf8-4a13-b906-948f4773e1a1.platform.comcast.net otelcol-contrib[31779]: github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces.func1
Jul 25 19:49:36 laas-0afba94e-fbf8-4a13-b906-948f4773e1a1.platform.comcast.net otelcol-contrib[31779]: github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:243

We have the same behaviour, with 0.57.2

That error is returned by the trace endpoint and spanmetrics are sent no more.

amoscatelli · 2022-08-05T10:39:15Z

@albertteoh Should the issue be reopened or should we create a new one ?

albertteoh · 2022-08-05T11:52:58Z

We have the same behaviour, with 0.57.2

Thanks for informing, @amoscatelli.

0.57.2 should fix any panics caused by race conditions but I would not expect the above value not found in metricKeyToDimensions cache bug to be fixed yet, if my theory is correct.

@albertteoh Should the issue be reopened or should we create a new one ?

I can create the issue and work on the fix over the weekend; but feel free to create it if you like.

I'm curious, did increasing dimensions_cache_size improve the situation (i.e. fewer cases of the "value not found in metricKeyToDimensions cache" errors)? This clue would support the theory that the cache has reached capacity and evicting keys in the same trace batch.

amoscatelli · 2022-08-05T12:04:48Z

I con confirm this only this evening.
I need more time to pass and more keys/metrics to be collected.

I'll inform you soon.

amoscatelli · 2022-08-05T14:31:32Z

Sorry but the EC2 machine I am running collector on went unresponsive.

It seemed like the behaviour became stabler, maybe it went down for a memory shartage but I cannot be sure.

I'll try to investigate ... sadly this makes spanmetrics still unusable with additional dimensions ....

amoscatelli · 2022-08-05T14:58:43Z

I increased allocated memory to both docker image and host machine .... let's see if this change something ... 😢

…ction (#11149) * Fix premature key eviction * Cleanup * Cleanup Signed-off-by: albertteoh <[email protected]> * Add changelog entry Signed-off-by: albertteoh <[email protected]> * Fix flaky test Signed-off-by: albertteoh <[email protected]> * Fix lint error: spelling Signed-off-by: albertteoh <[email protected]> * Fix flaky test * Fix lint spelling error: behaviour -> behavior Signed-off-by: albertteoh <[email protected]> * Prefer simpler Mutex * Remove incorrectly added changelog entry Signed-off-by: albertteoh <[email protected]> * Add changelog file as per PR creation instructions Signed-off-by: albertteoh <[email protected]>

albertteoh added 4 commits June 18, 2022 19:29

Fix premature key eviction

4b40949

Merge branch 'main' into 9018-fix-concurrency-bug

82de593

Cleanup

3cebd95

Cleanup

c32ba5d

Signed-off-by: albertteoh <[email protected]>

albertteoh requested a review from a team as a code owner June 18, 2022 10:46

albertteoh requested a review from codeboten June 18, 2022 10:46

github-actions bot assigned dmitryax Jun 18, 2022

albertteoh added 5 commits June 18, 2022 20:50

Add changelog entry

5fc7387

Signed-off-by: albertteoh <[email protected]>

Fix flaky test

3c475d7

Signed-off-by: albertteoh <[email protected]>

Fix lint error: spelling

8947bc2

Signed-off-by: albertteoh <[email protected]>

Fix flaky test

bc343c8

Fix lint spelling error: behaviour -> behavior

4adc7aa

Signed-off-by: albertteoh <[email protected]>

Merge branch 'main' into 9018-fix-concurrency-bug

ed39d93

albertteoh mentioned this pull request Jun 23, 2022

metricKeyToDimensions cache is printing many stack traces #9018

Closed

greut reviewed Jun 24, 2022

View reviewed changes

Prefer simpler Mutex

852edab

albertteoh changed the title ~~Fix concurrency bug causing premature key eviction~~ [processor/spanmetrics] Fix concurrency bug causing premature key eviction Jul 3, 2022

albertteoh added 3 commits July 3, 2022 14:28

Remove incorrectly added changelog entry

cfc64e2

Signed-off-by: albertteoh <[email protected]>

Merge branch 'main' into 9018-fix-concurrency-bug

fb58593

Add changelog file as per PR creation instructions

c753ab7

Signed-off-by: albertteoh <[email protected]>

TylerHelmuth approved these changes Jul 7, 2022

View reviewed changes

dmitryax approved these changes Jul 12, 2022

View reviewed changes

dmitryax merged commit 2d59782 into open-telemetry:main Jul 12, 2022

albertteoh deleted the 9018-fix-concurrency-bug branch July 14, 2022 12:33

albertteoh mentioned this pull request Jul 14, 2022

Remove irrelevant comment #12427

Merged

amoscatelli mentioned this pull request Jul 22, 2022

[spanmetricsprocessor] panic when using spanmetric with v0.56 #12644

Closed

ierezell mentioned this pull request Aug 4, 2022

Update the opentelemetry dependency to 0.57 (or above) grafana/agent#1993

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[processor/spanmetrics] Fix concurrency bug causing premature key eviction #11149

[processor/spanmetrics] Fix concurrency bug causing premature key eviction #11149

albertteoh commented Jun 18, 2022 •

edited

Loading

albertteoh commented Jun 18, 2022

albertteoh commented Jun 23, 2022

greut left a comment

albertteoh commented Jun 26, 2022

greut commented Jun 26, 2022

TylerHelmuth left a comment

TylerHelmuth Jul 7, 2022

albertteoh Jul 14, 2022

TylerHelmuth commented Jul 7, 2022

amoscatelli commented Jul 12, 2022

amoscatelli commented Jul 12, 2022

amoscatelli commented Jul 22, 2022

dixanms commented Jul 25, 2022

albertteoh commented Jul 26, 2022 •

edited

Loading

amoscatelli commented Aug 5, 2022

amoscatelli commented Aug 5, 2022

amoscatelli commented Aug 5, 2022 •

edited

Loading

albertteoh commented Aug 5, 2022

amoscatelli commented Aug 5, 2022

amoscatelli commented Aug 5, 2022

amoscatelli commented Aug 5, 2022

[processor/spanmetrics] Fix concurrency bug causing premature key eviction #11149

[processor/spanmetrics] Fix concurrency bug causing premature key eviction #11149

Conversation

albertteoh commented Jun 18, 2022 • edited Loading

albertteoh commented Jun 18, 2022

albertteoh commented Jun 23, 2022

greut left a comment

Choose a reason for hiding this comment

albertteoh commented Jun 26, 2022

greut commented Jun 26, 2022

TylerHelmuth left a comment

Choose a reason for hiding this comment

TylerHelmuth Jul 7, 2022

Choose a reason for hiding this comment

albertteoh Jul 14, 2022

Choose a reason for hiding this comment

TylerHelmuth commented Jul 7, 2022

amoscatelli commented Jul 12, 2022

amoscatelli commented Jul 12, 2022

amoscatelli commented Jul 22, 2022

dixanms commented Jul 25, 2022

albertteoh commented Jul 26, 2022 • edited Loading

amoscatelli commented Aug 5, 2022

amoscatelli commented Aug 5, 2022

amoscatelli commented Aug 5, 2022 • edited Loading

albertteoh commented Aug 5, 2022

amoscatelli commented Aug 5, 2022

amoscatelli commented Aug 5, 2022

amoscatelli commented Aug 5, 2022

albertteoh commented Jun 18, 2022 •

edited

Loading

albertteoh commented Jul 26, 2022 •

edited

Loading

amoscatelli commented Aug 5, 2022 •

edited

Loading