-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better load balancing of Envoys across Pilot instances #11181
Comments
The solution needs to consider:
Following Slack conversation with @Stono, we suggest the following.
The above can be run periodically or on event (change in pilot count). Feedback welcomed! |
@elevran
After a few minutes, all sidecar are still connecting the old pilot instance.
Anything I missing? Or Is there any recommended value for the parameter? |
Seeing as this is enabled by default in 1.1.0 at 30minutes, I'd really like to understand:
I'm half inclined to edit the injector config and remove it because the potential increased churn worries me. Also, can anyone confirm if |
I think it has, but not much. With at most MaxServerConnectionAgeGrace, which is default to 10s. IMO, this option should be disabled by default. |
grpc-lb is now considered deprecated. |
@morvencao what was the maximum age configured? Rebalance won't happen before the expiration of the maximum age. The default (if unspecified) is infinity.
@Stono according to grpc-go/keepalive.go there is a +/- 10% jitter on the configured value to avoid connection storms. So a 30 min maximum age will spread reconnects over 6 minutes. // The current default value is infinity.
// MaxConnectionAge is a duration for the maximum amount of time a
// connection may exist before it will be closed by sending a GoAway. A
// random jitter of +/-10% will be added to MaxConnectionAge to spread out
// connection storms. |
@elevran |
Is affinity set for istio-pilot service? |
|
@hzxuzhonghu No |
This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
I've tested with 3 pilots and total tens of pods, and |
Another related problem, if you are on the border of 1 or 2 pilots needed, you get this really bad behavior where we keep flipping between 1 and 2 replicas and the pilot takes 30minutes to fully shed the load |
This is pretty broken now, even the max connection age.
Setting max_requests_per_connection on the envoy (client) seems to fix this in case (2) |
One question: what does one request mean? A xds request is one? Will the connection always break when any new xds request comes. |
Request is one gRPC stream, its an http level setting not XDS. |
Got it. |
* Fix load balancing of pilot connections Context: #11181 (comment) * fix repetitive code * Update goldens
* Fix load balancing of pilot connections Context: #11181 (comment) * fix repetitive code * Update goldens Co-authored-by: John Howard <[email protected]>
* Fix load balancing of pilot connections Context: istio#11181 (comment) * fix repetitive code * Update goldens
Just to sync up: it seems work for me: |
And hours later:
And no restart
|
istioctl version |
@howardjohn I can see default keepaliveMaxServerConnectionAge values is 30m in istiod deployment. I have couple of query regarding this. Can you please check the below. Thank you.
https://istio.io/latest/docs/reference/commands/pilot-discovery/
|
|
Describe the feature request
This is in continuation of #7878
Envoys maintain long lived connections to Pilot. In HA scenarios, instances are ready at different times and thus earlier instances receive a disproportionate number of connections. This imbalance is exacerbated during rolling upgrades.
The request is to create a more balanced split of connections between Envoy and Pilots, one that has some intelligence to balance the loads among all Pilot replicas.
Describe alternatives you've considered
#10838, #10870 and #11126 provide a short fix by capping the maximum connection age, allowing rebalancing of load over time. This solution is not ideal:
Additional context
grpc-lb has been suggested as an alternative solution. Client side LB distributes logic to all Envoys (in addition, it may not solve this problem since it relies on name resolution which would replicate the imbalance as instances come and go. See discussion here).
It may be preferrable if we could encapsulate the implementation server-side, entirely in Pilot.
This comment on the original issue provides some additional context.
The text was updated successfully, but these errors were encountered: