Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/loadbalacing] Refactor how metrics are split and then re-joined after load-balacing #33293

Merged
merged 17 commits into from
Jun 19, 2024

Conversation

RichieSams
Copy link
Contributor

@RichieSams RichieSams commented May 29, 2024

Description: The previous code splits an incoming pmetric.Metrics into individual pmetric.Metrics instances, at the granularity of pmetric.Metric. Then afterwards, it used the various routingID functions to create a map of booleans, in order to define how the metrics should be routed. Finally, it merged the metrics by routing key, and exported them by concatenating them all together. While this worked, it's somewhat hard to follow, and inefficient for most of the routingIDs. In a future PR, we'd like to add a new routingID, which would require splitting at the datapoint level. This would add a ton of extra work for the other routingIDs, which don't care about specific datapoints.

Therefore, the new code has dedicated splitting functions for each routingID. These functions directly return a map[string]pmetric.Metrics instance. IE, a map of routing keys to its metrics. These functions can be unit tested directly, and makes the logic in ConsumeMetrics() very easy to follow. Lastly, when combining metrics for routing, the new code utilizes the MergeMetrics() helper function from internal/exp/metrics. This merges the metrics and removes duplicate ResourceMetrics / ScopeMetrics instances. Which saves compute and bandwidth for serialization downstream.

Link to tracking Issue: 32513

Testing: I created a full suite of tests for each routingID enum. For both single endpoint, as well an multi-endpoint loadbalancing

Documentation: The code is documented in comments. I added a changelog entry to explain changes for users

@jpkrohling
Copy link
Member

@RichieSams, is this ready for review?

@RichieSams
Copy link
Contributor Author

It's ready for technical review. I just need to add a changelog and fix up the CI tests

So we can rely on the compiler to make sure we don't have typos
@RichieSams RichieSams marked this pull request as ready for review June 3, 2024 18:06
@fatsheep9146
Copy link
Contributor

please fix the conflicts

@RichieSams
Copy link
Contributor Author

please fix the conflicts

Fixed. This PR is ready for final review and merge

Copy link
Contributor

@fatsheep9146 fatsheep9146 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have non-resolved failed checks

--- a/exporter/loadbalancingexporter/go.mod
+++ b/exporter/loadbalancingexporter/go.mod
@@ -74,7 +74,6 @@ require (
 	github.com/imdario/mergo v0.3.6 // indirect
 	github.com/inconshreveable/mousetrap v1.1.0 // indirect
 	github.com/josharian/intern v1.0.0 // indirect
-	github.com/json-iterator/go v1.1.12 // indirect
 	github.com/klauspost/compress v1.17.8 // indirect
 	github.com/knadh/koanf/maps v0.1.1 // indirect
 	github.com/knadh/koanf/providers/confmap v0.1.0 // indirect
go.mod/go.sum deps changes detected, please run "make gotidy" and commit the changes in this PR.

@RichieSams
Copy link
Contributor Author

Tests are failing on an unrelated issue:

         	Error Trace:	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/kubeletstatsreceiver/e2e_test.go:83
        	            				/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/kubeletstatsreceiver/e2e_test.go:54
        	Error:      	Condition never satisfied
        	Test:       	TestE2E
        	Messages:   	failed to receive 10 entries,  received 0 metrics in 3 minutes

All other tests are green

@fatsheep9146
Copy link
Contributor

Tests are failing on an unrelated issue:

         	Error Trace:	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/kubeletstatsreceiver/e2e_test.go:83
        	            				/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/kubeletstatsreceiver/e2e_test.go:54
        	Error:      	Condition never satisfied
        	Test:       	TestE2E
        	Messages:   	failed to receive 10 entries,  received 0 metrics in 3 minutes

All other tests are green

yes, this is a known bug

@RichieSams
Copy link
Contributor Author

Woohoo! everything is green again. Are we ok to merge now?

@RichieSams
Copy link
Contributor Author

@jpkrohling Is this ok to merge?

@mx-psi
Copy link
Member

mx-psi commented Jun 18, 2024

Thanks! Since we are in the middle of a release, I will put a reminder to merge this right after this release happens (so this would be available on v0.104.0)

@mx-psi
Copy link
Member

mx-psi commented Jun 19, 2024

@RichieSams Can you fix the merge conflicts so that I can merge this? Thanks!

@RichieSams
Copy link
Contributor Author

@RichieSams Can you fix the merge conflicts so that I can merge this? Thanks!

Updated. Thanks!

@mx-psi mx-psi merged commit d6eaca8 into open-telemetry:main Jun 19, 2024
154 checks passed
@github-actions github-actions bot added this to the next release milestone Jun 19, 2024
@jpkrohling
Copy link
Member

@RichieSams, sorry I wasn't able to review this before it got merged. Would you be willing to add benchmarks, assessing that this change is at least on par with the previous version in terms of performance?

@RichieSams RichieSams deleted the loadbalancingexporter_refactor branch June 20, 2024 12:58
@RichieSams
Copy link
Contributor Author

Sure thing. Before this change:

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^(BenchmarkConsumeMetrics_1E100T|BenchmarkConsumeMetrics_1E1000T|BenchmarkConsumeMetrics_5E100T|BenchmarkConsumeMetrics_5E500T|BenchmarkConsumeMetrics_5E1000T|BenchmarkConsumeMetrics_10E100T|BenchmarkConsumeMetrics_10E500T|BenchmarkConsumeMetrics_10E1000T)$ github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter
cpu: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz
BenchmarkConsumeMetrics_1E100T-24      	    6747	    189668 ns/op	   72946 B/op	    1522 allocs/op
BenchmarkConsumeMetrics_1E1000T-24     	     633	   1715175 ns/op	  713044 B/op	   15028 allocs/op
BenchmarkConsumeMetrics_5E100T-24      	    5209	    198354 ns/op	   74243 B/op	    1557 allocs/op
BenchmarkConsumeMetrics_5E500T-24      	    1438	    762235 ns/op	  360968 B/op	    7567 allocs/op
BenchmarkConsumeMetrics_5E1000T-24     	     504	   1996786 ns/op	  719125 B/op	   15072 allocs/op
BenchmarkConsumeMetrics_10E100T-24     	    6991	    160781 ns/op	   76704 B/op	    1610 allocs/op
BenchmarkConsumeMetrics_10E500T-24     	    1448	    767864 ns/op	  362681 B/op	    7630 allocs/op
BenchmarkConsumeMetrics_10E1000T-24    	     624	   1641427 ns/op	  720533 B/op	   15140 allocs/op

After:

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^(BenchmarkConsumeMetrics_1E100T|BenchmarkConsumeMetrics_1E1000T|BenchmarkConsumeMetrics_5E100T|BenchmarkConsumeMetrics_5E500T|BenchmarkConsumeMetrics_5E1000T|BenchmarkConsumeMetrics_10E100T|BenchmarkConsumeMetrics_10E500T|BenchmarkConsumeMetrics_10E1000T)$ github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter
cpu: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz
BenchmarkConsumeMetrics_1E100T-24      	  336166	      4425 ns/op	    1476 B/op	      29 allocs/op
BenchmarkConsumeMetrics_1E1000T-24     	  257836	      4008 ns/op	    1479 B/op	      29 allocs/op
BenchmarkConsumeMetrics_5E100T-24      	   77956	     15485 ns/op	    5744 B/op	     130 allocs/op
BenchmarkConsumeMetrics_5E500T-24      	   59664	     18696 ns/op	    5759 B/op	     130 allocs/op
BenchmarkConsumeMetrics_5E1000T-24     	   71922	     15106 ns/op	    5694 B/op	     130 allocs/op
BenchmarkConsumeMetrics_10E100T-24     	   27265	     40060 ns/op	   12907 B/op	     268 allocs/op
BenchmarkConsumeMetrics_10E500T-24     	   27732	     37414 ns/op	   12892 B/op	     268 allocs/op
BenchmarkConsumeMetrics_10E1000T-24    	   40455	     29029 ns/op	   12957 B/op	     268 allocs/op

It's significantly faster and uses significantly less memory

@RichieSams
Copy link
Contributor Author

My next change updates the benchmarks to also test different combinations of ResourceMetrics, ScopeMetrics, Metrics, and DataPoints

@jpkrohling
Copy link
Member

Wonderful, thank you!

@RichieSams
Copy link
Contributor Author

#33676

cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this pull request Jul 11, 2024
…ined after load-balacing (open-telemetry#33293)

**Description:** The previous code splits an incoming `pmetric.Metrics`
into individual `pmetric.Metrics` instances, at the granularity of
`pmetric.Metric`. Then afterwards, it used the various routingID
functions to create a map of booleans, in order to define how the
metrics should be routed. Finally, it merged the metrics by routing key,
and exported them by concatenating them all together. While this worked,
it's somewhat hard to follow, and inefficient for most of the
routingIDs. In a future PR, we'd like to add a new routingID, which
would require splitting at the datapoint level. This would add a ton of
extra work for the other routingIDs, which don't care about specific
datapoints.

Therefore, the new code has dedicated splitting functions for each
routingID. These functions directly return a
`map[string]pmetric.Metrics` instance. IE, a map of routing keys to its
metrics. These functions can be unit tested directly, and makes the
logic in `ConsumeMetrics()` very easy to follow. Lastly, when combining
metrics for routing, the new code utilizes the `MergeMetrics()` helper
function from `internal/exp/metrics`. This merges the metrics and
removes duplicate ResourceMetrics / ScopeMetrics instances. Which saves
compute and bandwidth for serialization downstream.


**Link to tracking Issue:** 32513

**Testing:** I created a full suite of tests for each routingID enum.
For both single endpoint, as well an multi-endpoint loadbalancing

**Documentation:** The code is documented in comments. I added a
changelog entry to explain changes for users

---------

Co-authored-by: Pablo Baeyens <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants