AWS DescribeAutoScalingGroups requests too aggressive - API limits reached #252

kylegato · 2017-08-23T00:50:54Z

Our cluster-autoscaler pods are dying frequently due to the fact that AWS is rate limiting the API calls.

In a 5 minute period, we logged over 1,509 calls in CloudTrail for "DescribeAutoScalingGroups"

Here's the error that causes cluster-autoscaler to crash: https://gist.github.com/kylegato/9e2a183eca549572ce0e0082c4381dab

This also prevents us from loading the ASG page on the console.

      - command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/s2-us-west-2
        env:
        - name: AWS_REGION
          value: us-west-2
        image: gcr.io/google_containers/cluster-autoscaler:v0.6.0

It's also important to note that I run two clusters with this.

Node Count (Cluster 1): 5 nodes
Node Count (Cluster 2): 50 Nodes

The text was updated successfully, but these errors were encountered:

MaciekPytel · 2017-08-23T08:36:38Z

My initial guess would be that it may be related to using node-group-auto-discovery feature. @kylegato can you try running without using it (configuring your node groups statically with --nodes)?

@mumoshu I suspect the PollingAutoscaler is probably doing a DescribeAutoScalingGroup API call for each node group on each loop (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/polling_autoscaler.go#L81). Does that sound right? How can we limit this?

kylegato · 2017-08-23T21:53:40Z

@MaciekPytel Confirmed that using static node group definitions seems to work, I'm no longer hitting the limits and my container is not crashing.

I'll keep an eye on this issue to determine when/if it's safe to switch back to auto discovery.

Thanks!

mumoshu · 2017-08-24T01:07:39Z

@MaciekPytel Thanks for your support.

Hmm, it wasn't an expected behavior at least but basically I guess you're correct.
Probably we'd need to retain a part of the whole CA state - cache of ASGs - across CA loops.

Just to start by syncing it up with you, CA with the automatic discovery is expected to work like:

Detect target ASGs for each scan interval.
- This is done by calling DescribeAutoScalingGroup once per maxRecordsReturnedByAPI=100 ASGs. In other words, the API calls are batched. Therefore, if @kylegato had ASGs less than 100 in total, the num of additional API calls made by the automatic discovery feature in the 5 min should had been 300 / 10 * 2 = 60. It is far lower than 1,509.
..And the rest of the process is the same as that of CA without automatic discovery.

So, I'd say that there's something wrong is happening in 2.

AFAIK, CA calls DescribeAutoScalingGroup multiple times per ASG per loop, unless they are not cached. On the other hand, PollingAutoscaler recreates the whole CA state, throwing away the ASGs cache. Indeed, this would results in huge number of API calls.

It could be done with an implementation of what we discussed in the original PR of the automatic discovery feature.
Basically, we somehow retain the cache of ASGs - awsCloudProvier.awsManager.autoScalingGroups - in each iteration of PollingAutoscaler.
We would also need to evict old ASGs from the cache. Otherwise CA would end up stopping scaling in/out against the ASG with an outdated cache because it contains the min size and the max size of the ASG.

mumoshu · 2017-08-24T02:13:44Z

Btw, increasing --scan-interval will also alleviate the problem.

MaciekPytel · 2017-08-24T08:47:09Z

@mumoshu I may be missing something, but it seems to me that the PollingAutoscaler only replaces the existing autoscaler if the config has changed. So we'd only do all the extra calls due to lost cache if the config actually changed (which should be very rare, so we probably don't care too much). I don't think the autoscaler was actually replaced recently in @kylegato's log, as there are nodes that were unneeded for 2+ minutes and the timers would reset if the autoscaler was replaced.

So my intuition would be that those calls somehow come from building a new autoscaler instance in L81 (obvious culprit would be recreating cloudprovider, but it may well be something else in there). So the followup action would be to determine where those calls are coming from and if we can do something about them. Perhaps we could try to get a new list of node group ids without building an autoscaler object?

Finally, I don't like increasing --scan-interval as a solution to this problem as it hurts autoscaler's reaction time badly. Maybe we can just limit the rate of polling by doing it once every few loops as opposed to every loop? Not sure about that, WDYT?

mumoshu · 2017-08-25T11:43:39Z

@kylegato Btw, in case you're using kube-aws 0.9.7 with CA updated to 0.6 by hand while enabling the node draining feature, I'd recommend upgrading to 0.9.8-rc.1.
There is an improvement in the node draining feature which reduces the number of DescribeAutoScalingGroups calls.

Please see kubernetes-retired/kube-aws#782 for more information if that's the case.

mumoshu · 2017-08-25T12:01:15Z

@MaciekPytel

I may be missing something, but it seems to me that the PollingAutoscaler only replaces the existing autoscaler if the config has changed

Yes, and that's done by the step 1 described in my prev comment.
The step 1 happens before a replace actually happens, to detect changes in ASGs.
So, I believe it is still 1 DescribeAutoScalingGroups call per interval in a normal case.

So the followup action would be to determine where those calls are coming from and if we can do something about them.

Although I'm still not sure where the cause could be, I agree to your suggestion.
How about logging every DescribeAutoScalingGroups call by perhaps glog.V(5).Infof - a log level larger by one than the one used for logging AWS API errors?

Perhaps we could try to get a new list of node group ids without building an autoscaler object?

Sorry if I'm not following you correctly, but AFAICS, rebuilding an autoscaler object won't trigger DescribeAutoScalingGroups calls more than described in the step 1 above.

Anyway, your suggestion seems to be certainly feasible to me by extracting the auto-discovery logic of ASGs from the AWS cloud provider impl to somewhere else, and then calling it from PollingAutoscaler.

mumoshu · 2017-08-25T12:04:58Z

@kylegato Btw, would you be sure that a considerable portion of 1,509 calls is made by CA?
I'm still wondering if most of these calls are made by the node draining feature of kube-aws by any chance as in my prev comment.

mumoshu · 2017-08-25T12:06:41Z

@MaciekPytel Regarding the --scan-interval, I agree with you. Increasing it should be done if and only if it is a last resort.

kylegato · 2017-08-25T20:01:40Z

I'm running kube-aws GIT (from master) version 65cb8efbf769aa3c5fc9459f4ed37fefa11e354a

I'm assuming the calls are CA because I don't really have much else going on in this AWS account, I can look into it in a bit and report back, I'll need to pull my logs and parse them from S3.

mtcode · 2017-08-25T21:32:18Z

I've also seen quota limits exceeded on API calls on the AWS web console, especially during periods of rapid scaling. This is without auto-discovery enabled, but the scan interval is set to 15 seconds.

bismarck · 2017-09-18T20:32:55Z

@mumoshu @MaciekPytel I'm also seeing this issue with automatic discovery enabled.

Here's what I'm seeing in my CloudTrail logs for DescribeAutoscalingGroups requests in a 24 hour period:

During the same period here are the number "Regenerating ASG information for" messages in CA log:

I noticed the issue starts after an hour of starting the CA, which makes me think the problem is with this cache refresh code.

Here's a snippet from my CA logs right before the hour expires:
CA-Issue.txt

mtcode · 2017-09-18T21:12:33Z

Commit f04113d has definitely helped in this, but there is still some throttling on API calls from AWS.

MaciekPytel · 2017-10-18T15:49:48Z

I've closed #400 after @bismarck pointed out it's duplicate of this. I'm reasonably confident that leaking cache refresh goroutines by PollingAutoscaler is the root cause here. It explains the gradual buildup of API calls observed by @bismarck very well and reading the code seems to confirm it.

@mumoshu Do you want to take a shot at fixing this?

mumoshu · 2017-10-20T10:32:36Z

@MaciekPytel Thanks for reaching me out but I'm afraid I can't work on it shortly. However, please assign me if you're also out of time. I'll try to manage it.

MaciekPytel · 2017-10-20T11:50:13Z

I don't use aws at all, so I can't fix it. I can review and provide any other help if someone else wants to take a shot at this.

mmerrill3 · 2017-10-20T21:35:42Z

Looking at this with @bismarck, and noted that the autoscaler (including the aws manager, and auto_scaling_groups with the go routine) is being recreated every poll method. This results in a go routine (that runs forever) being created every 10 seconds, since our scan interval is 10 seconds. The go routine is there to refresh the k8s cache for ASGs every hour.
I think the function of the go routine (to create a fresh view of the ASGs in AWS) should be exposed as a separate method in the auto_scaling_groups.go. Then, users of auto_scaling_groups can decide when to refresh the cache. Also, polling_autoscaler.go wouldn't have to create a new autoscaler every poll interval. It could just refresh the cache.
The static_autocaler would then have to keep track of when to call the new refresh cache method, probably in its own go routine that runs forever.

MaciekPytel · 2017-10-23T09:44:56Z

The cache itself is an implementation detail of a given cloudprovider and I'd rather not have static_autoscaler know about it. However, I'm thinking about adding some sort of a general purpose refresh method to CloudProvider that would be called once per loop. The implementation of such method could maintain some internal counter and refresh the cache every N loops or after M minutes have elapsed since last refresh or whatever other internal mechanism makes sense in this case.

PollingAutoscaler was implemented a while ago when StaticAutoscaler couldn't handle adding or removing NodeGroup. So it's recreating the whole StaticAutoscaler when the config changes (unfortunately it's also creating a temporary one to check if it has changed). However, recently we've updated StaticAutoscaler and it's almost ready to handle NodeGroups being dynamically added or removed on cloudprovider level (this is how a GKE equivalent of PollingAutoscaler will work). I think the best long-term solution is to reimplement PollingAutoscaler functionality using the same approach, but I would wait 2 more weeks or so until we finish all the necessary prerequisites.

Since my long-term idea would require a large-ish refactor I'm ok to review and merge any reasonable short-term fix for this specific issue.

mumoshu · 2017-10-31T23:21:43Z

In case you missed it, the fix in ongoing at #422 by @mmerrill3!

mwielgus · 2017-11-03T10:07:33Z

Fixed in #422 and cherrypicked to 1.0.x in #445. Closing.

johanneswuerbach · 2017-11-15T20:49:51Z

We are running v1.0.2 and are still seeing various autoscaler failures and crashes due to Throttling: Rate exceeded status code: 400.

mumoshu · 2017-11-16T08:41:04Z

@johanneswuerbach Would you mind sharing us:

(1) Number of ASGs you have in your cluster
(2) Number of ASGs in your AWS account
(3) In which timing does CA seem to crash more? scaling up, down, etc
(4) Does number of API calls still increase gradually like we've seen before?

At this point, for me, problematic part left in code-base seems to be:

The fact that we are not batching DescribeAutoScalingGroups calls in regenerateCache

only.

In case you're already calling AWS API a lot from another systems, perhaps CA should be improved to just retry API call with back-off when rated limited.

johanneswuerbach · 2017-11-16T09:04:43Z

We currently have 5 clusters with 6 ASGs each (3 for HA masters, 3 for nodes in each AZ) so 30 Auto Scaling Groups in total.

Each of those clusters is running an autoscaler instance.

This is the smaller of our two accounts (we see crashes in both) and it generally doesn't have any scaling activity (its our staging env).

Crashes seem to mostly encounter in

autoscaler/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go

Line 78 in d25acce

return nil, fmt.Errorf("Failed to get ASGs: %v", err)

, which fails with Throttling: Rate exceeded

We currently don't feed AWS api calls into our log aggregation system, but I will try to get those numbers.

bismarck · 2017-11-16T21:59:06Z

I just checked CA 1.0.2 on our cluster (which is in a similar setup as @johanneswuerbach) and we haven't seen a crash yet. Here are the number of "Regenerating ASG information for" messages in CA log for 24 hour period:

Looks like what was expected after the fix. Need to check my CloudTrail logs for DescribeAutoscalingGroups requests.

mumoshu · 2017-11-19T13:25:41Z

@johanneswuerbach Thanks for sharing the details! Seems like I'm unable to figure out anything right now.

Would you mind sharing me the full error message and what's you are passing to --node-group-auto-discovery?

At least in theory, the describe-tags calls should't be that many to trigger a rate limit. So I suspect there's any pattern or combination of tags to trigger that.

johanneswuerbach · 2017-11-20T07:21:35Z

@mumoshu I'm passing --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/<name> and full command:

            - ./cluster-autoscaler
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/<name>
            - --balance-similar-node-groups
            - --scale-down-utilization-threshold=0.7

I'm seeing the following two kind of stack traces:

E1120 06:39:32.367506       1 static_autoscaler.go:128] Failed to update node registry: Throttling: Rate exceeded
	status code: 400, request id: 90f99c8a-cdbd-11e7-891a-590162651f15

and followed by a crash:

I1120 06:09:02.367838       1 leaderelection.go:174] attempting to acquire leader lease...
I1120 06:09:02.384670       1 leaderelection.go:184] successfully acquired lease kube-system/cluster-autoscaler
F1120 06:33:06.092320       1 cloud_provider_builder.go:111] Failed to create AWS cloud provider: Failed to get ASGs: Throttling: Rate exceeded
	status code: 400, request id: aac62ecf-cdbc-11e7-ab1b-8be915c8bac1
goroutine 77 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog.stacks(0xc420ab4300, 0xc420f64780, 0xd4, 0x1e0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog/glog.go:766 +0xa7
k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog.(*loggingT).output(0x518d4e0, 0xc400000003, 0xc420c1e000, 0x4e0ea3f, 0x19, 0x6f, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog/glog.go:717 +0x348
k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog.(*loggingT).printf(0x518d4e0, 0x3, 0x3434a38, 0x27, 0xc4218a1080, 0x1, 0x1)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog/glog.go:655 +0x14f
k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog.Fatalf(0x3434a38, 0x27, 0xc4218a1080, 0x1, 0x1)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog/glog.go:1145 +0x67
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder.CloudProviderBuilder.Build(0x7ffc1e022686, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder/cloud_provider_builder.go:111 +0x756
k8s.io/autoscaler/cluster-autoscaler/core.NewAutoscalingContext(0xa, 0x3fe6666666666666, 0x8bb2c97000, 0x1176592e000, 0x0, 0x4e200, 0x0, 0x186a00000, 0x0, 0x7ffc1e0226e3, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/autoscaling_context.go:147 +0x466
k8s.io/autoscaler/cluster-autoscaler/core.NewStaticAutoscaler(0xa, 0x3fe6666666666666, 0x8bb2c97000, 0x1176592e000, 0x0, 0x4e200, 0x0, 0x186a00000, 0x0, 0x7ffc1e0226e3, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:55 +0x14d
k8s.io/autoscaler/cluster-autoscaler/core.(*AutoscalerBuilderImpl).Build(0xc42090cb60, 0x0, 0x0, 0xc4218a1b38, 0x1107e60)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/autoscaler_builder.go:71 +0x10e
k8s.io/autoscaler/cluster-autoscaler/core.(*PollingAutoscaler).Poll(0xc4219e28a0, 0x4, 0xed1a46e1d)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/polling_autoscaler.go:81 +0x5a
k8s.io/autoscaler/cluster-autoscaler/core.(*PollingAutoscaler).RunOnce(0xc4219e28a0, 0xed1a46e1d, 0xe2fd62e5e, 0x518d0a0, 0x518d0a0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/polling_autoscaler.go:68 +0x76
main.run(0xc420a02870)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:262 +0x46f
main.main.func2(0xc420bd4540)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:344 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:145 +0x97

Could it be the k8s.io/cluster-autoscaler/enabled tag, which is shared by all clusters?

…openshift-4.14-atomic-openshift-cluster-autoscaler Updating atomic-openshift-cluster-autoscaler images to be consistent with ART

add log filename and line number

mwielgus added area/provider/aws Issues or PRs related to aws provider area/cluster-autoscaler labels Aug 30, 2017

bismarck mentioned this issue Oct 18, 2017

PollingAutoscaler leaking memory and goroutines? #400

Closed

MaciekPytel added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 18, 2017

MaciekPytel mentioned this issue Oct 19, 2017

Autoscaler modifies ASG resources that it shouldn't #406

Closed

mmerrill3 mentioned this issue Oct 25, 2017

AWS DescribeAutoScalingGroups requests too aggressive - API limits reached #421

Closed

mwielgus closed this as completed Nov 3, 2017

mumoshu mentioned this issue Nov 6, 2017

Fix AWS API rate-limiting caused by CA 0.6.x~1.0.1 kubernetes-retired/kube-aws#999

Closed

johanneswuerbach mentioned this issue Feb 6, 2018

Release Cluster Autoscaler 1.2 rc #618

Closed

jkiske mentioned this issue Sep 15, 2018

Cluster autoscaler makes too many DescribeAutoscalingGroup requests benperez/engine-import-test#188

Closed

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024

Merge pull request kubernetes#252 from cmssczy/add-source-line-in-log

8da3a0f

add log filename and line number

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS DescribeAutoScalingGroups requests too aggressive - API limits reached #252

AWS DescribeAutoScalingGroups requests too aggressive - API limits reached #252

kylegato commented Aug 23, 2017 •

edited

Loading

MaciekPytel commented Aug 23, 2017

kylegato commented Aug 23, 2017

mumoshu commented Aug 24, 2017

mumoshu commented Aug 24, 2017 •

edited

Loading

MaciekPytel commented Aug 24, 2017

mumoshu commented Aug 25, 2017

mumoshu commented Aug 25, 2017

mumoshu commented Aug 25, 2017

mumoshu commented Aug 25, 2017 •

edited

Loading

kylegato commented Aug 25, 2017

mtcode commented Aug 25, 2017

bismarck commented Sep 18, 2017 •

edited

Loading

mtcode commented Sep 18, 2017

MaciekPytel commented Oct 18, 2017

mumoshu commented Oct 20, 2017

MaciekPytel commented Oct 20, 2017

mmerrill3 commented Oct 20, 2017 •

edited

Loading

MaciekPytel commented Oct 23, 2017

mumoshu commented Oct 31, 2017

mwielgus commented Nov 3, 2017

johanneswuerbach commented Nov 15, 2017

mumoshu commented Nov 16, 2017 •

edited

Loading

johanneswuerbach commented Nov 16, 2017 •

edited

Loading

bismarck commented Nov 16, 2017

mumoshu commented Nov 19, 2017

johanneswuerbach commented Nov 20, 2017

AWS DescribeAutoScalingGroups requests too aggressive - API limits reached #252

AWS DescribeAutoScalingGroups requests too aggressive - API limits reached #252

Comments

kylegato commented Aug 23, 2017 • edited Loading

MaciekPytel commented Aug 23, 2017

kylegato commented Aug 23, 2017

mumoshu commented Aug 24, 2017

mumoshu commented Aug 24, 2017 • edited Loading

MaciekPytel commented Aug 24, 2017

mumoshu commented Aug 25, 2017

mumoshu commented Aug 25, 2017

mumoshu commented Aug 25, 2017

mumoshu commented Aug 25, 2017 • edited Loading

kylegato commented Aug 25, 2017

mtcode commented Aug 25, 2017

bismarck commented Sep 18, 2017 • edited Loading

mtcode commented Sep 18, 2017

MaciekPytel commented Oct 18, 2017

mumoshu commented Oct 20, 2017

MaciekPytel commented Oct 20, 2017

mmerrill3 commented Oct 20, 2017 • edited Loading

MaciekPytel commented Oct 23, 2017

mumoshu commented Oct 31, 2017

mwielgus commented Nov 3, 2017

johanneswuerbach commented Nov 15, 2017

mumoshu commented Nov 16, 2017 • edited Loading

johanneswuerbach commented Nov 16, 2017 • edited Loading

bismarck commented Nov 16, 2017

mumoshu commented Nov 19, 2017

johanneswuerbach commented Nov 20, 2017

kylegato commented Aug 23, 2017 •

edited

Loading

mumoshu commented Aug 24, 2017 •

edited

Loading

mumoshu commented Aug 25, 2017 •

edited

Loading

bismarck commented Sep 18, 2017 •

edited

Loading

mmerrill3 commented Oct 20, 2017 •

edited

Loading

mumoshu commented Nov 16, 2017 •

edited

Loading

johanneswuerbach commented Nov 16, 2017 •

edited

Loading