Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unregistered nodes are not removed. #803

Closed
tcolgate opened this issue Apr 17, 2018 · 22 comments
Closed

Unregistered nodes are not removed. #803

tcolgate opened this issue Apr 17, 2018 · 22 comments
Labels
area/cluster-autoscaler lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@tcolgate
Copy link

The logic for handling unregistered nodes seems off. If we have an unregistered node, we clearly aren't going to be able to schedule workloads on it. If the cluster size is already at minimum capacity, then removing an unregistered node fails:

I0417 10:50:13.797723       1 static_autoscaler.go:162] 1 unregistered nodes present
I0417 10:50:13.797744       1 utils.go:324] Removing unregistered node aws:https:///eu-west-1c/i-XXXXXXX
W0417 10:50:13.835734       1 utils.go:340] Failed to remove node aws:https:///eu-west-1c/i-XXXXXX: node group min size reached, skipping unregistered node removal

Under these conditions we can't remove nodes that have failed to boot. Surely it would make more senese to forcibly terminate the node?

On a related note, a metric for unregistered nodes would be useful, since the CA is the only thing that has a view of the cloud providers list of nodes, and noticing this is quite hard without it.

Happy to contribute if you agree. Was considering adding a force option to the provider node deletions.

@aleksandra-malinowska
Copy link
Contributor

This logic was introduced to mitigate #369. Allowing for force deletion of nodes was one of the considered options. It would require extending cloud provider interface and implementing some kind of back-off (to avoid repeated errors on deletion attempts.) It would also mean breaking guarantees regarding respecting minimum and maximum node-group limits. @mwielgus @MaciekPytel should we reopen this discussion?

@MaciekPytel
Copy link
Contributor

If I understand correctly the problem happens because ASG won't allow CA to delete a node if it's at min size. However, CA manages min/max size constraints itself and it expects that nodes will be deleted if call to cloudprovider is made. So from the perspective of core CA we consider any delete a force delete. I'd rather not differentiate between different kinds of deletions in core CA as the details are very cloudprovider specific.

It feels like the simplest solution would be to just leave managing of min/max boundaries to CA and always set ASG minimum size to 0.

@aleksandra-malinowska
Copy link
Contributor

It feels like the simplest solution would be to just leave managing of min/max boundaries to CA and always set ASG minimum size to 0.

IIRC, we still wouldn't attempt to delete such node in this case: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/utils.go#L346

@MaciekPytel
Copy link
Contributor

You're right. For some reason(ENOCOFFEE) I though we would delete the node.

I think it would make sense to remove the node, though it means we'd be explicitly breaking min/max constraints specified by user. We should at least document it somewhere if we make the change (maybe we should add a section about min/max to FAQ, especially given that we already get a lot of questions about existing behaviour).

@tcolgate
Copy link
Author

FWIW, I'm not sure this breaches any promises about not going below the minimums, at least in the unregistered nodes case. The nodes being removed are not active registered kube nodes, so you aren't removing anything of any use.
I don't think setting ASG min to 0 is a good approach: Having a minimum on the ASG means that the min/max sizes are maintained even if the controller goes down. If you have a min of 0 a group could lose all it's nodes and the group would not recover.
It's perfectly permissible to Terminate an ec2 instance in an ASG even if that takes the ASG count down. The ASG simply scales things up to meet the minimum. This is no different to manually terminating a node that may be sitting on a problematic VM instance with disk/network issues.

@aleksandra-malinowska
Copy link
Contributor

aleksandra-malinowska commented Apr 18, 2018

FWIW, I'm not sure this breaches any promises about not going below the minimums, at least in the unregistered nodes case. The nodes being removed are not active registered kube nodes, so you aren't removing anything of any use.

From the user's point of view, they will notice a node group's count is below minimum. How it got there, and whether it was justified, will require investigating (checking events, logs etc.) I would prefer not to have to explain what exactly happened in such cases :)

It's perfectly permissible to Terminate an ec2 instance in an ASG even if that takes the ASG count down. The ASG simply scales things up to meet the minimum. This is no different to manually terminating a node that may be sitting on a problematic VM instance with disk/network issues.

Makes sense. This is not the case for all cloud providers, though. We can't rely on this behavior to bring the node back up outside AWS.

Thinking of it, what we would like to do in such case isn't really removing the node, but replacing it with a healthy one. Some ways to achieve this:

  1. Force delete followed by create. Risks leaving group below min size. Requires implementing some back-off so we don't end up crashlooping if we can't delete such node. (On the other hand, we should probably have some back-off from deleting nodes when the group is above minimum, too.)
  2. Just create a new node in that node group. Then, node group's size will be above minimum and we'll be able to remove the old node in the next iteration. Risks "flapping" behavior (if new nodes in such group don't register for a reason, we'll keep creating and deleting a node every ~15 minutes.)
  3. Avoid putting this in Cluster Autoscaler logic. Instead, extend cloud provider interface with Restart() (or similar) method. Make sure we don't return with error from the main function if this method fails. Exact implementation depends on cloud provider (so it can take advantage of their specific features, like the one described above.) Risks inconsistent behavior across cloud providers and conflicting with other node repair mechanisms they may have.

@tcolgate
Copy link
Author

points 1 and 2 make it sound like you really don't want to use the native cloud providers ASGs at all. If you manually remove and recreate nodes you may end up racing with the cloud providers actual ASG logic (what happens if they create a new node while you are recreating one). If you adjust the ASG size before you do it you might aswell just do away with the provider side ASG completely. On balance though that does not seem sensible.

On point 2, what to do if creating a new node takes you over the max? It doesn't seem completely crazy that a user might create a node group with a min == max (e.g. creating a single spot instance node in a specific AZ). As things stand, if min == max, then you wont remove unregistered nodes, and wouldn't be able to add one. The users has no actual registered kube nodes, but CA isn't going to take any action to rectify the situation.

@aleksandra-malinowska
Copy link
Contributor

On point 2, what to do if creating a new node takes you over the max? It doesn't seem completely crazy that a user might create a node group with a min == max (e.g. creating a single spot instance node in a specific AZ). As things stand, if min == max, then you wont remove unregistered nodes, and wouldn't be able to add one. The users has no actual registered kube nodes, but CA isn't going to take any action to rectify the situation

Fair enough, we probably can't assume min != max in AWS setup with auto-discovery (which is what I assume you're referring to.) In GKE, validation enforces max > min when enabling autoscaling for a node-pool. Arguably, setting max == min means we shouldn't autoscale that group (or is there another way to express that?), so ignoring it sounds reasonable.

As for the case with no registered nodes, Cluster Autoscaler doesn't attempt to fix a cluster that it considers too unhealthy anyway.

points 1 and 2 make it sound like you really don't want to use the native cloud providers ASGs at all.

Not at all, that's what point 3 is about. It's hard to abstract over what different cloud providers offer, and we can't rely on their specific features in shared code.

Does it mean you'd rather go for 3, or are you proposing another solution?

@tcolgate
Copy link
Author

tcolgate commented Apr 18, 2018 via email

@MaciekPytel
Copy link
Contributor

I don't like the idea of extending NodeGroup interface unless absolutely necessary. It's already very hard to implement CA cloudprovider, especially for any sort of on-prem environment. Any extra functionality we require will only make that problem worse.

points 1 and 2 make it sound like you really don't want to use the native cloud providers ASGs at all.

I would say that's actually true to a large extent. The concept of NodeGroup is a resizable set of identical nodes. ASG just happens to be an implementation of that concept. We generally don't want logic specific to a given cloud provider in core (at least not as a long term solution, we have been more liberal for short term hacks). If anything I'd like to simplify the expectation we put on cloud provider as much as possible to enable on-prem / private cloud use-cases (we even had requests to make it possible to implement cloudprovider as a set of few shell scripts that would provision a VM: #283 (comment)).

@tcolgate
Copy link
Author

tcolgate commented May 8, 2018

I think it's worth considering what the behaviour is when the CA isn't running. If you set the min/max in the cloud providers autoscaling group, (at least on amazon), the cloud provider will make sure the group size stays within the bounds.. If you rely on CA to actually manage the group size (rather than just the bounds), then you may possibly end up with bootstrapping issues (e.g. multizone kops creats one ASG per master in each zone with min == max == 1, those aren't manage by CA admittedly).

@aleksandra-malinowska
Copy link
Contributor

I don't like the idea of extending NodeGroup interface unless absolutely necessary. It's already very hard to implement CA cloudprovider, especially for any sort of on-prem environment. Any extra functionality we require will only make that problem worse.

This method would be optional, i.e., stub implementation returning "not implemented" error would result in the same behavior as we have now (log an error and go on.)

I think it's worth considering what the behaviour is when the CA isn't running. If you set the min/max in the cloud providers autoscaling group, (at least on amazon), the cloud provider will make sure the group size stays within the bounds.. If you rely on CA to actually manage the group size (rather than just the bounds), then you may possibly end up with bootstrapping issues (e.g. multizone kops creats one ASG per master in each zone with min == max == 1, those aren't manage by CA admittedly).

On GCP, there are no min/max limits in MIGs. Those can only be set for autoscaling (either in GCE/MIG autoscaler, or Cluster Autoscaler.) A MIG stays at a target size (restarting instances that fail health check etc.) unless a user (or an autoscaler) resizes it. Explicitly requesting instance deletion brings the target size down.

IIUC, on AWS the limits work more like self-imposed quota - they can be set even if autoscaling is disabled, and a user's request to resize the group outside those limits will fail?

@tcolgate
Copy link
Author

re:
From the user's point of view, they will notice a node group's count is below minimum. How it got there, and whether it was justified, will require investigating (checking events, logs etc.) I would prefer not to have to explain what exactly happened in such cases :)

In practical terms this is quite a non-trivial thing to notice. In reality, to spot and fix this I end up spotting the unregistered nodes listed in the CA logs and going and restarting them. It also means I basically am perpetually nervous as to the number of nodes I'm actually running, vs the number usefully contributing to the cluster (causes wasting of money, which keeps me up at night). Some additional metrics might help but ultimately it still seems wrong that, as things stand, we could have a node group with a min size of 20, which actually has 0 registered/useful running nodes, and CA can't fix it. In that situation I don't think promises around not going below minSize are actually useful promises.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2018
@aleksandra-malinowska
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 10, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 9, 2019
@aleksandra-malinowska
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 9, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 9, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024
[makefile] Use the cached MPI module to get its CRDs for integration tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants