Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unregistered node can't be removed when min size of group is reached #369

Closed
aleksandra-malinowska opened this issue Sep 25, 2017 · 1 comment

Comments

@aleksandra-malinowska
Copy link
Contributor

aleksandra-malinowska commented Sep 25, 2017

In this failed test run, we encountered an edge case where a broken node couldn't be removed from the cluster due to node group minimum size restriction. This causes CA to fall into a loop attempting to remove this node. The impact of this bug is significantly reduced by liveness probe (see below.)

Scenario:

  1. node group with size n and minSize <=n, with one of the nodes in this node group remaining unregistered
  2. after 15 minutes of normal operation, CA will attempt to remove the node
  3. if node group is still at or below its minSize, CA will repeat this attempt in every iteration (and keep failing)
  4. within 10-20 minutes, liveness probe will fail due to repeated failures and restart CA
  5. go to 2.

Impact:

  1. restarts by liveness probe give 15 minutes in which, if a demand for scale up of the affected node group occurs, the problem could be resolved (after the affected node group's size is increased, it should be possible to remove unregistered node)
  2. however, if cluster activity causes only other node groups to be scaled up/down, the result is that CA works approximately 50% of the time (15 minutes on, ~15 minutes off)
  3. in e2e tests, it causes all the remaining scenarios to fail.

Proposed solutions:

  1. implement a backoff when attempting to remove unregistered node
  2. extend cloudprovider interface to allow for overriding min size restriction on node deletion

cc @MaciekPytel

@mwielgus
Copy link
Contributor

After discussions and fix attempt it was agreed that exact conditions to trigger this are rare enough to not make it a release blocker. Will be fixed in a patch release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants