Cluster autoscaler scales up too much #43709

mwielgus · 2017-03-27T15:54:51Z

On some occasions Cluster Autoscaler may scale up the cluster two times. This results in growing the cluster too large. The size gets back to normal soon but the user will be forced to pay for the unneeded node for 10min/1h, depending on the cloud provider.

cc: @MaciekPytel @fgrzadkowski @ethernetdan @enisoc

bgrant0607 · 2017-03-27T20:24:37Z

@mwielgus Please provide more information.

mwielgus · 2017-03-27T20:46:01Z

@bgrant0607
Imagine that you have just created a pod that cannot be scheduled. CA expands one of the migs to accommodate it. Just when the node arrives there is a brief moment when the pod is still pending (scheduler didn't retry it yet) but the node is available. Scale up is over. Pod is pending. So we might try to scale up again, which would be incorrect. To protect against such situations we run a simulated scheduling on all nodes. If any node can accommodate any of the pending pods then the pod is not pending and should not trigger a scale up.

After a scale up we create "virtual" nodes inside CA so that we don't scale up again for the same pods. Once the node arrives its virtual placeholder is deleted.

And, if I'm correct, we have a bug here. A node with node condition ready=true but with non ready network is considered as fully started but broken. Its virtual placeholder is gone so we trigger scale up again. Seconds later the network arrives and the scale up appears to be non-needed.

I'm testing right now a 10 line fix for this issue. After the fix is merged I would like to bump CA version in 1.6 as having the old code would impact all of the current CA users on GCE and GKE (and possibly on other cloud providers too).

liggitt · 2017-03-27T20:49:21Z

is this a regression in cluster autoscaler for 1.6, or did the same behavior exist in 1.5?

mwielgus · 2017-03-27T20:50:35Z

It is a regression. Doesn't happen always (depends on the network setup timings) but when it does it is quite confusing.

@MaciekPytel

Automatic merge from submit-queue Cluster-autoscaler: Fix isNodeStarting Fix for: kubernetes/kubernetes#43709 cc: @MaciekPytel @fgrzadkowski

Automatic merge from submit-queue Bump cluster autoscaler to 0.5.1 Fixes: #43709 **Release note**: ```release-note With Cluster Autoscaler 0.5 the cluster will be autoscaled even if there are some unready or broken nodes. Moreover the status of CA is exposed in kube-system/cluster-autoscaler-status config map. ```

@MaciekPytel

…tarting-fix Automatic merge from submit-queue Cluster-autoscaler: Fix isNodeStarting Fix for: kubernetes/kubernetes#43709 cc: @MaciekPytel @fgrzadkowski

@MaciekPytel

…tarting-fix Automatic merge from submit-queue Cluster-autoscaler: Fix isNodeStarting Fix for: kubernetes/kubernetes#43709 cc: @MaciekPytel @fgrzadkowski

mwielgus added priority/P1 sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. labels Mar 27, 2017

mwielgus added this to the v1.6 milestone Mar 27, 2017

mwielgus self-assigned this Mar 27, 2017

mwielgus changed the title ~~Cluster autoscaler scales to far~~ Cluster autoscaler scales up too much Mar 27, 2017

bgrant0607 removed the priority/P1 label Mar 27, 2017

calebamiles modified the milestones: v1.6.1, v1.6 Mar 27, 2017

bgrant0607 added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 27, 2017

mwielgus modified the milestones: v1.6, v1.6.1 Mar 27, 2017

mwielgus added the priority/P1 label Mar 27, 2017

mwielgus mentioned this issue Mar 27, 2017

Cluster-autoscaler: Fix isNodeStarting kubernetes-retired/contrib#2502

Merged

mwielgus mentioned this issue Mar 28, 2017

Bump cluster autoscaler to 0.5.1 #43745

Merged

mwielgus removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 28, 2017

k8s-github-robot closed this as completed in #43745 Mar 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster autoscaler scales up too much #43709

Cluster autoscaler scales up too much #43709

mwielgus commented Mar 27, 2017

bgrant0607 commented Mar 27, 2017

mwielgus commented Mar 27, 2017

liggitt commented Mar 27, 2017

mwielgus commented Mar 27, 2017

Cluster autoscaler scales up too much #43709

Cluster autoscaler scales up too much #43709

Comments

mwielgus commented Mar 27, 2017

bgrant0607 commented Mar 27, 2017

mwielgus commented Mar 27, 2017

liggitt commented Mar 27, 2017

mwielgus commented Mar 27, 2017