Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further improvements to API calls #35

Open
eytan-avisror opened this issue Dec 5, 2019 · 0 comments
Open

Further improvements to API calls #35

eytan-avisror opened this issue Dec 5, 2019 · 0 comments
Milestone

Comments

@eytan-avisror
Copy link
Collaborator

eytan-avisror commented Dec 5, 2019

It was seen on clusters with massive numbers of ALBs/Target Groups (300-400 ALBs + 400-500 Target groups), things start to break down - once controller starts getting throttled very heavily, other components such as alb-ingress-controller also fails to register/deregister targets.

Also in #27 we introduced a condition where the life of an instance cannot be extended infinitely - and a lifecycle hook is timed out after 1hr.

When throttling gets very very heavy and many instances are terminating (10+) we basically lock up the AWS API with all the calls we are making and a lot of instances eventually get abandoned after spending an hour getting throttled - which ends up with 5xx errors to target groups that did not deregister.

We should consider the following improvements for very large clusters:

  • Instead of starting the deregister waiter after jitter (0-180s), we should start the waiter after a different range which includes the deregistration delay. e.g. if the deregistration delay is 300 (default), this means we know for a fact the deregister will take atleast 300 seconds, so the jitter should probably be on top of that, such as a range between 300-400 seconds. this will mean much less calls we know will return telling us the target is still deregistering. this will reduce the load on other components such as alb controller during the time of initial deregistration.

  • Make backoff range even larger/wider than currently set range (5s-60s). e.g. (30s-180s)

  • Make lifecycle hook timeout configurable - for massive cluster it may be acceptable to set a higher number

  • Have some logic around checking how many target groups exist in the account.
    Calculate how many concurrent terminations we can handle.
    queue instances which are over the limit (possibly have a higher timeout for queued instances)

  • Fix Abandoned instances don't drop active goroutines #34 - in such scenarios it will make recovery much faster after the instance has been abandoned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant