Unregistered nodes are not removed. #803

tcolgate · 2018-04-17T11:09:24Z

The logic for handling unregistered nodes seems off. If we have an unregistered node, we clearly aren't going to be able to schedule workloads on it. If the cluster size is already at minimum capacity, then removing an unregistered node fails:

I0417 10:50:13.797723       1 static_autoscaler.go:162] 1 unregistered nodes present
I0417 10:50:13.797744       1 utils.go:324] Removing unregistered node aws:https:///eu-west-1c/i-XXXXXXX
W0417 10:50:13.835734       1 utils.go:340] Failed to remove node aws:https:///eu-west-1c/i-XXXXXX: node group min size reached, skipping unregistered node removal

Under these conditions we can't remove nodes that have failed to boot. Surely it would make more senese to forcibly terminate the node?

On a related note, a metric for unregistered nodes would be useful, since the CA is the only thing that has a view of the cloud providers list of nodes, and noticing this is quite hard without it.

Happy to contribute if you agree. Was considering adding a force option to the provider node deletions.

The text was updated successfully, but these errors were encountered:

aleksandra-malinowska · 2018-04-17T15:37:56Z

This logic was introduced to mitigate #369. Allowing for force deletion of nodes was one of the considered options. It would require extending cloud provider interface and implementing some kind of back-off (to avoid repeated errors on deletion attempts.) It would also mean breaking guarantees regarding respecting minimum and maximum node-group limits. @mwielgus @MaciekPytel should we reopen this discussion?

MaciekPytel · 2018-04-17T16:34:42Z

If I understand correctly the problem happens because ASG won't allow CA to delete a node if it's at min size. However, CA manages min/max size constraints itself and it expects that nodes will be deleted if call to cloudprovider is made. So from the perspective of core CA we consider any delete a force delete. I'd rather not differentiate between different kinds of deletions in core CA as the details are very cloudprovider specific.

It feels like the simplest solution would be to just leave managing of min/max boundaries to CA and always set ASG minimum size to 0.

aleksandra-malinowska · 2018-04-17T16:40:00Z

It feels like the simplest solution would be to just leave managing of min/max boundaries to CA and always set ASG minimum size to 0.

IIRC, we still wouldn't attempt to delete such node in this case: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/utils.go#L346

MaciekPytel · 2018-04-17T17:06:42Z

You're right. For some reason^(ENOCOFFEE) I though we would delete the node.

I think it would make sense to remove the node, though it means we'd be explicitly breaking min/max constraints specified by user. We should at least document it somewhere if we make the change (maybe we should add a section about min/max to FAQ, especially given that we already get a lot of questions about existing behaviour).

tcolgate · 2018-04-18T06:35:01Z

FWIW, I'm not sure this breaches any promises about not going below the minimums, at least in the unregistered nodes case. The nodes being removed are not active registered kube nodes, so you aren't removing anything of any use.
I don't think setting ASG min to 0 is a good approach: Having a minimum on the ASG means that the min/max sizes are maintained even if the controller goes down. If you have a min of 0 a group could lose all it's nodes and the group would not recover.
It's perfectly permissible to Terminate an ec2 instance in an ASG even if that takes the ASG count down. The ASG simply scales things up to meet the minimum. This is no different to manually terminating a node that may be sitting on a problematic VM instance with disk/network issues.

aleksandra-malinowska · 2018-04-18T09:19:34Z

FWIW, I'm not sure this breaches any promises about not going below the minimums, at least in the unregistered nodes case. The nodes being removed are not active registered kube nodes, so you aren't removing anything of any use.

From the user's point of view, they will notice a node group's count is below minimum. How it got there, and whether it was justified, will require investigating (checking events, logs etc.) I would prefer not to have to explain what exactly happened in such cases :)

It's perfectly permissible to Terminate an ec2 instance in an ASG even if that takes the ASG count down. The ASG simply scales things up to meet the minimum. This is no different to manually terminating a node that may be sitting on a problematic VM instance with disk/network issues.

Makes sense. This is not the case for all cloud providers, though. We can't rely on this behavior to bring the node back up outside AWS.

Thinking of it, what we would like to do in such case isn't really removing the node, but replacing it with a healthy one. Some ways to achieve this:

Force delete followed by create. Risks leaving group below min size. Requires implementing some back-off so we don't end up crashlooping if we can't delete such node. (On the other hand, we should probably have some back-off from deleting nodes when the group is above minimum, too.)
Just create a new node in that node group. Then, node group's size will be above minimum and we'll be able to remove the old node in the next iteration. Risks "flapping" behavior (if new nodes in such group don't register for a reason, we'll keep creating and deleting a node every ~15 minutes.)
Avoid putting this in Cluster Autoscaler logic. Instead, extend cloud provider interface with Restart() (or similar) method. Make sure we don't return with error from the main function if this method fails. Exact implementation depends on cloud provider (so it can take advantage of their specific features, like the one described above.) Risks inconsistent behavior across cloud providers and conflicting with other node repair mechanisms they may have.

tcolgate · 2018-04-18T09:29:59Z

points 1 and 2 make it sound like you really don't want to use the native cloud providers ASGs at all. If you manually remove and recreate nodes you may end up racing with the cloud providers actual ASG logic (what happens if they create a new node while you are recreating one). If you adjust the ASG size before you do it you might aswell just do away with the provider side ASG completely. On balance though that does not seem sensible.

On point 2, what to do if creating a new node takes you over the max? It doesn't seem completely crazy that a user might create a node group with a min == max (e.g. creating a single spot instance node in a specific AZ). As things stand, if min == max, then you wont remove unregistered nodes, and wouldn't be able to add one. The users has no actual registered kube nodes, but CA isn't going to take any action to rectify the situation.

aleksandra-malinowska · 2018-04-18T09:57:34Z

On point 2, what to do if creating a new node takes you over the max? It doesn't seem completely crazy that a user might create a node group with a min == max (e.g. creating a single spot instance node in a specific AZ). As things stand, if min == max, then you wont remove unregistered nodes, and wouldn't be able to add one. The users has no actual registered kube nodes, but CA isn't going to take any action to rectify the situation

Fair enough, we probably can't assume min != max in AWS setup with auto-discovery (which is what I assume you're referring to.) In GKE, validation enforces max > min when enabling autoscaling for a node-pool. Arguably, setting max == min means we shouldn't autoscale that group (or is there another way to express that?), so ignoring it sounds reasonable.

As for the case with no registered nodes, Cluster Autoscaler doesn't attempt to fix a cluster that it considers too unhealthy anyway.

points 1 and 2 make it sound like you really don't want to use the native cloud providers ASGs at all.

Not at all, that's what point 3 is about. It's hard to abstract over what different cloud providers offer, and we can't rely on their specific features in shared code.

Does it mean you'd rather go for 3, or are you proposing another solution?

tcolgate · 2018-04-18T11:10:09Z

Yes, I think so restarting an unregistered node is probably reasonable (it's actually how I fixed the issue I had while debugging this behaviour ;) ).

…

On Wed, 18 Apr 2018 at 10:57 Aleksandra Malinowska ***@***.***> wrote: On point 2, what to do if creating a new node takes you over the max? It doesn't seem completely crazy that a user might create a node group with a min == max (e.g. creating a single spot instance node in a specific AZ). As things stand, if min == max, then you wont remove unregistered nodes, and wouldn't be able to add one. The users has no actual registered kube nodes, but CA isn't going to take any action to rectify the situation Fair enough, we probably can't assume min != max in AWS setup with auto-discovery (which is what I assume you're referring to.) In GKE, validation enforces max > min when enabling autoscaling for a node-pool. Arguably, setting max == min means we shouldn't autoscale that group (or is there another way to express that?), so ignoring it sounds reasonable. As for the case with no registered nodes, Cluster Autoscaler doesn't attempt to fix a cluster that it considers too unhealthy anyway. points 1 and 2 make it sound like you really don't want to use the native cloud providers ASGs at all. Not at all, that's what point 3 is about. It's hard to abstract over what different cloud providers offer, and we can't rely on their specific features in shared code. Does it mean you'd rather go for 3, or are you proposing another solution? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#803 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEo8zxNrbvEhjkoBNAuMUrzbz1eLr-8ks5tpw4TgaJpZM4TYGIU> .

MaciekPytel · 2018-04-18T16:12:23Z

I don't like the idea of extending NodeGroup interface unless absolutely necessary. It's already very hard to implement CA cloudprovider, especially for any sort of on-prem environment. Any extra functionality we require will only make that problem worse.

points 1 and 2 make it sound like you really don't want to use the native cloud providers ASGs at all.

I would say that's actually true to a large extent. The concept of NodeGroup is a resizable set of identical nodes. ASG just happens to be an implementation of that concept. We generally don't want logic specific to a given cloud provider in core (at least not as a long term solution, we have been more liberal for short term hacks). If anything I'd like to simplify the expectation we put on cloud provider as much as possible to enable on-prem / private cloud use-cases (we even had requests to make it possible to implement cloudprovider as a set of few shell scripts that would provision a VM: #283 (comment)).

tcolgate · 2018-05-08T06:58:10Z

I think it's worth considering what the behaviour is when the CA isn't running. If you set the min/max in the cloud providers autoscaling group, (at least on amazon), the cloud provider will make sure the group size stays within the bounds.. If you rely on CA to actually manage the group size (rather than just the bounds), then you may possibly end up with bootstrapping issues (e.g. multizone kops creats one ASG per master in each zone with min == max == 1, those aren't manage by CA admittedly).

aleksandra-malinowska · 2018-05-08T11:44:29Z

I don't like the idea of extending NodeGroup interface unless absolutely necessary. It's already very hard to implement CA cloudprovider, especially for any sort of on-prem environment. Any extra functionality we require will only make that problem worse.

This method would be optional, i.e., stub implementation returning "not implemented" error would result in the same behavior as we have now (log an error and go on.)

I think it's worth considering what the behaviour is when the CA isn't running. If you set the min/max in the cloud providers autoscaling group, (at least on amazon), the cloud provider will make sure the group size stays within the bounds.. If you rely on CA to actually manage the group size (rather than just the bounds), then you may possibly end up with bootstrapping issues (e.g. multizone kops creats one ASG per master in each zone with min == max == 1, those aren't manage by CA admittedly).

On GCP, there are no min/max limits in MIGs. Those can only be set for autoscaling (either in GCE/MIG autoscaler, or Cluster Autoscaler.) A MIG stays at a target size (restarting instances that fail health check etc.) unless a user (or an autoscaler) resizes it. Explicitly requesting instance deletion brings the target size down.

IIUC, on AWS the limits work more like self-imposed quota - they can be set even if autoscaling is disabled, and a user's request to resize the group outside those limits will fail?

tcolgate · 2018-06-13T09:20:27Z

re:
From the user's point of view, they will notice a node group's count is below minimum. How it got there, and whether it was justified, will require investigating (checking events, logs etc.) I would prefer not to have to explain what exactly happened in such cases :)

In practical terms this is quite a non-trivial thing to notice. In reality, to spot and fix this I end up spotting the unregistered nodes listed in the CA logs and going and restarting them. It also means I basically am perpetually nervous as to the number of nodes I'm actually running, vs the number usefully contributing to the cluster (causes wasting of money, which keeps me up at night). Some additional metrics might help but ultimately it still seems wrong that, as things stand, we could have a node group with a min size of 20, which actually has 0 registered/useful running nodes, and CA can't fix it. In that situation I don't think promises around not going below minSize are actually useful promises.

fejta-bot · 2018-09-11T09:37:39Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.