-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[processor/spanmetrics] Fix concurrency bug causing premature key eviction #11149
[processor/spanmetrics] Fix concurrency bug causing premature key eviction #11149
Conversation
Signed-off-by: albertteoh <[email protected]>
Signed-off-by: albertteoh <[email protected]>
Signed-off-by: albertteoh <[email protected]>
Signed-off-by: albertteoh <[email protected]>
Signed-off-by: albertteoh <[email protected]>
All build checks have been addressed and ready for review. |
@open-telemetry/collector-contrib-approvers PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to go with a sync.Map
rather than manual mutex management?
Thanks for the suggestion, @greut 😄 Which map instance were you thinking of replacing with a The primary motivation for this PR is to ensure serial execution when aggregating metrics and then building the final set of metrics for emitting which depends on the existence of these keys in the cache. The reason for needing this guarantee is because keys could be evicted due to races, even though the underlying LRU cache is thread-safe and even if we were to use |
@albertteoh sorry. I'm realizing that I read it way too quicky and see now that the sync part is removed around the evicted keys. |
Signed-off-by: albertteoh <[email protected]>
Signed-off-by: albertteoh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit, otherwise LGTM
if err != nil { | ||
p.logger.Error(err.Error()) | ||
} else if err = p.metricsExporter.ConsumeMetrics(ctx, *m); err != nil { | ||
// Export metrics first before forwarding trace to avoid being impacted by downstream trace processor errors/latency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still accurate? Aren't traces being forward continuously now that this is in a goroutine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good pickup, @TylerHelmuth. Addressed in this PR: #12427
@codeboten @dmitryax please review. |
Any luck in merging this ? |
thank you ! |
…ction (open-telemetry#11149) * Fix premature key eviction * Cleanup * Cleanup Signed-off-by: albertteoh <[email protected]> * Add changelog entry Signed-off-by: albertteoh <[email protected]> * Fix flaky test Signed-off-by: albertteoh <[email protected]> * Fix lint error: spelling Signed-off-by: albertteoh <[email protected]> * Fix flaky test * Fix lint spelling error: behaviour -> behavior Signed-off-by: albertteoh <[email protected]> * Prefer simpler Mutex * Remove incorrectly added changelog entry Signed-off-by: albertteoh <[email protected]> * Add changelog file as per PR creation instructions Signed-off-by: albertteoh <[email protected]>
spanmetricsprocessor seems to be broken in 0.56, maybe this issue fix is related ? |
Before v0.56.0, our otel collectors configured with the
After upgrading to v0.56.0, when that error occurs, traces seem to continue exporting correctly but it stops exporting span metrics.
|
Thanks for reporting the bug @dixanms.
Yes, that makes sense because metrics and spans are processed in separate threads in the new design. I would not expect this to happen for every span, but just for some spans (let me know if this is not the case for you), which I explain below... I suspect this is caused by the metric keys being evicted from the LRU cache, particularly when you have a large number of heterogeneous spans a batch of traces, more than what the cache can hold. You could try increasing the cache size in config as a temporary workaround. Example: opentelemetry-collector-contrib/processor/spanmetricsprocessor/testdata/config-full.yaml Line 39 in 471b2fb
A proper fix would be to loop over the cache's keys which has the size limit imposed by the above config, rather than the metric data keys which is simply an unbounded map. I'll do a bit of testing to prove this theory before putting together a fix. |
We still have the issue. Please let us know about testing your theory. Thank you. |
We have the same behaviour, with 0.57.2 That error is returned by the trace endpoint and spanmetrics are sent no more. |
@albertteoh Should the issue be reopened or should we create a new one ? |
Thanks for informing, @amoscatelli. 0.57.2 should fix any panics caused by race conditions but I would not expect the above
I can create the issue and work on the fix over the weekend; but feel free to create it if you like. I'm curious, did increasing |
I con confirm this only this evening. I'll inform you soon. |
Sorry but the EC2 machine I am running collector on went unresponsive. It seemed like the behaviour became stabler, maybe it went down for a memory shartage but I cannot be sure. I'll try to investigate ... sadly this makes spanmetrics still unusable with additional dimensions .... |
I increased allocated memory to both docker image and host machine .... let's see if this change something ... 😢 |
…ction (#11149) * Fix premature key eviction * Cleanup * Cleanup Signed-off-by: albertteoh <[email protected]> * Add changelog entry Signed-off-by: albertteoh <[email protected]> * Fix flaky test Signed-off-by: albertteoh <[email protected]> * Fix lint error: spelling Signed-off-by: albertteoh <[email protected]> * Fix flaky test * Fix lint spelling error: behaviour -> behavior Signed-off-by: albertteoh <[email protected]> * Prefer simpler Mutex * Remove incorrectly added changelog entry Signed-off-by: albertteoh <[email protected]> * Add changelog file as per PR creation instructions Signed-off-by: albertteoh <[email protected]>
Description: The processor creates and caches keys used to lookup the metrics dimensions; then, expecting it to be present in the cache, attempts to fetch this cache key. This occasionally results in an error:
value not found in metricKeyToDimensions cache by key
.The cause of this error is due to locks not covering the entire flow above leading to race conditions where the cache key is prematurely evicted prior to its access. Moreover, caching the key uses
ContainsOrAdd
which does not update the recent-ness of the key if the key exists, exacerbating the problem leading to a higher frequency of such occurrences.ContainsOrAdd
documentation:This results in two problems:
To solve the above problems, there are three key themes of change in this PR:
ContainsOrAdd
to two separate calls:Get
to check if the key exists. If the key exists, its recent-ness will be updated.Add
the key to cache.How this bug was reproduced:
aggregateMetrics
value not found in metricKeyToDimensions cache by key
Link to tracking Issue: #9018
Testing:
Documentation: No new documentation required for this.