Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Missing EXPECTED_CLUSTER_SIZE leads to massive load on brokers #221

Open
pantaoran opened this issue Sep 8, 2023 · 1 comment
Open

Comments

@pantaoran
Copy link

We observed that when EXPECTED_CLUSTER_SIZE is not set (or explicitly set to -1), this destroyed measured produce latencies.
It seems that before (or during?) every request, Canary was trying to micro-manage the replicas and their leaders for the canary topic on the Kafka cluster, which was taking a lot of time and processing, resulting in extremely slow responses to the produce requests.

Average latencies as reported when EXPECTED_CLUSTER_SIZE is set correctly: 3-5ms
Average latencies as reported when EXPECTED_CLUSTER_SIZE is NOT set: 1000-2000ms

Somehow the things that canary does on the cluster slow everything down dramatically.
It also leads to an explosion in logs. With the correct setting, my empty brokers (2-broker cluster, no other clients running except Canary) logged around 8 lines per minute. When the cluster size setting is missing, they logged around 500 lines per minute (the canary reconcile interval was 10sec=default).

I don't know what Canary does in detail or why, but it feels like a bug to me.

The description in the README says that I should expect more partitions reassignment of the topic while the Kafka cluster is starting up and the brokers are coming one by one, but what I actually observe is that partitions are getting reassigned on every reconciliation (every 10sec), leading to redundant work on the brokers, which cause high produce latencies and increased log volume.

@mschurenko
Copy link

I'm experiencing the same thing. I get the following message on the kafka controller every 10 seconds:

[2024-03-07 00:32:32,957] INFO [Controller id=2] Successfully updated assignment of partition __strimzi_canary-1 to
ReplicaAssignment(replicas=2,3,1, addingReplicas=, removingReplicas=, observers=, targetObservers=None) (kafka.controller.KafkaController)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants