fix: change HPA minReplicas to 2 to resolve PDB issues #25873

agilgur5 · 2020-07-26T22:56:12Z

Please provide a description for what this PR is for.

Description

PDBs for Istio services have a minAvailable of 1, and HPAs had a
minReplicas of 1 (as well as default replicaCount of 1), meaning that
the defaults would cause failures for things like kops rolling-update
bump minReplicas to 2 so that at least 1 pod can go down by default
- and bump autoScaleMin to 2 as well, which seems to correspond to
  HPA minReplicas as well
- and bump default replicaCount to 2 where it was previously 1
~~also bump maxReplicas to 10 to account for the doubling of minReplicas~~
- ~~and autoScaleMax respectively~~ EDIT: removed according to review

Review Notes

Hi there, first-time contributor here 👋 I was affected by #24000 when performing rolling-updates and saw it was marked as "good first issue", so thought I would try my hand at fixing it. Please let me know if any changes are necessary and offer any guidance 🙏

Things I wasn't totally sure of:

Per the description, I added autoScaleMin as I saw it was used in several places and minReplicas was set to it, but let me know if that's wrong.
~~I doubled maxReplicas / autoScaleMax to correspond to the doubled minimum, similar to some sample code, but if this should be left unchanged, can undo it.~~ EDIT removed according to review
I think replicaCount needs to be set to 2 in more places and perhaps in some places. I only changed the ones that were previously set for safety, but can add more. May need guidance on that.
Not really sure how to test this. Ideally would compare PDB minAvailable to default replicaCount and minReplicas, but those can be in lots of different places and I make no claims I understand how the multitudes of configs here work 😅 If someone could tell me where to look / guide me in the right direction, that would be helpful!

googlebot · 2020-07-26T22:56:18Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

istio-testing · 2020-07-26T22:56:27Z

Hi @agilgur5. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

agilgur5 · 2020-07-26T23:01:19Z

@googlebot I signed it

googlebot · 2020-07-26T23:01:24Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

shamsher31 · 2020-07-27T05:14:24Z

/ok-to-test

morvencao

Need to add validation logic to resolve issue, not only change the default value for PDB and HPA.
End users may configure incompliant minReplica value for PDB with HPA which prevents pods from restarting, blocking rescheduling, upgrade...

sdake · 2020-07-27T11:25:29Z

Please read through #12602 - there is a pretty good analysis of the problem.

sdake

As an initial PR, I think it is fine not to validate the defaults. More important is to fix this problem and backport this for 1.7.

I don't understand why the hpa spec replicas are increased to 2. Seems like it would be sufficient to keep hpa spec at 1/5, and then increase the replicas to 2.

manifests/profiles/default.yaml

sdake · 2020-07-27T11:47:18Z

@agilgur5 nice first PR! Thank you for the contribution. We have struggled and struggled a project with what to do about HPAs. The default for Istio is to install HPAs, therefore, the defaults should account for HPAs being installed. After reading the latest K8s docs on ratios used to determine an HPA, I am not clear this PR is correct, but you are on the right track.

Cheers,
-steve

howardjohn

I don't feel comfortable making the min replicas==1 for the default profile, as this will force all users, no matter how small of a cluster, to have double the resources used which is often a waste. Its also harder to debug, etc 2 replicas. I would rather have:

no PDB and have a "HA" profile with PDB+min replicas=2+
Kubernetes change to allow scaling up to meet the constraints of a PDB

sdake · 2020-07-27T15:26:04Z

I don't feel comfortable making the min replicas==1 for the default profile, as this will force all users, no matter how small of a cluster, to have double the resources used which is often a waste. Its also harder to debug, etc 2 replicas. I would rather have:

no PDB and have a "HA" profile with PDB+min replicas=2+

Kubernetes change to allow scaling up to meet the constraints of a PDB

Agree. It is a shame min replicas can not be zero unless alpha gates are turned on in K8s 1.18. That would be an easier answer than adding a profile. We do need to take care not to introduce profile proliferation. 👍

But turning the suggestion around, wouldn't it be better for the default profile to include HA, and a different profile to disable PDBs? We could add a notice that the default profile includes Xyz features, including HA capability to the profile - to be displayed by istioctl.

@esnible thoughts?

Cheers,
-steve

agilgur5 · 2020-07-27T16:30:15Z

We have struggled and struggled a project with what to do about HPAs. The default for Istio is to install HPAs, therefore, the defaults should account for HPAs being installed.

But turning the suggestion around, wouldn't it be better for the default profile to include HA, and a different profile to disable PDBs? We could add a notice that the default profile includes Xyz features, including HA capability

Throwing in my two cents, my response to both of these would be that I generally think of security, availability matters as things you have turned on by default -- best practices for production use turned on by default. One could also describe that as "making the hard things easy", there's then little effort in achieving the best practice. Then if there is a need, have a very explicit opt-out capability for PoCs or smaller clusters etc. An explicit opt-out means one has to actively acknowledge the ramifications of that choice.
Making security or HA opt-in allows "misconfigurations" and shortcuts to proliferate in my experience. It requires effort (and knowledge of their existence too) to use best practices and therefore it can often be forgotten or ignored.

Perhaps worthwhile to note, the decision also impacts the way one thinks of things, either as maintainer or as user. In opt-out, HA is the default and "forefront" with non-HA the "exception", whereas opt-in, HA is the "exception" and non-HA is forefront.

We do need to take care not to introduce profile proliferation. 👍

I think this is relevant too, as the current defaults already include a PDB together with a misconfigured default HPA (that this PR attempts to fix). Also, there is the existing minimal profile that I would think could be used for a non-HA set-up as well.

Y'all obviously understand Istio significantly better than me, but thought I'd throw in my two cents too as a k8s admin and user (as well as a library author and contributor myself). I occasionally find myself searching for HA and secure configuration in various charts / components and finding them missing because the defaults lack them -- adding them is considerable work and even if they exist they may not be as well-tested as the defaults. But best practices shouldn't be after thoughts (not that the Istio team treats them as such, just saying generally), especially for components that have wide impact on the cluster.

agilgur5 · 2020-07-27T16:40:52Z

Need to add validation logic to resolve issue, not only change the default value for PDB and HPA.

@morvencao would be happy to try adding that (perhaps in a separate PR so as to not block this) if you could point me in the right direction of where that might go. Per my review notes, I don't really know where to look for something like that or tests.

I don't understand why the hpa spec replicas are increased to 2. Seems like it would be sufficient to keep hpa spec at 1/5, and then increase the replicas to 2.

@sdake The issue(s) pointed to changing this to 2, but I do think it is necessary myself as well. If only 1 replica is available (whether that's set by HPA or the default replicaCount isn't relevant as I understand), the PDB will result in an error response to an eviction request. The HPA only scales up with response to utilization, not a pending eviction. But correct me if I'm wrong, I'm certainly not an expert in this.

howardjohn · 2020-07-27T16:49:20Z

I opened kubernetes/kubernetes#93476 about this on k8s side. I don't think we should just wait on k8s, but I am pretty convinced this is poor (but expected) behavior on k8s side.

…

On Mon, Jul 27, 2020 at 9:41 AM Anton Gilgur ***@***.***> wrote: Need to add validation logic to resolve issue, not only change the default value for PDB and HPA. @morvencao <https://github.com/morvencao> would be happy to try adding that (perhaps in a separate PR so as to not block this) if you could point me in the right direction of where that might go. Per my review notes, I don't really know where to look for something like that or tests. I don't understand why the hpa spec replicas are increased to 2. Seems like it would be sufficient to keep hpa spec at 1/5, and then increase the replicas to 2. The issue(s) pointed to changing this to 2, but I do think it is necessary myself as well. If only 1 replica is available (whether that's set by HPA or the default replicaCount isn't relevant as I understand), the PDB will result in an error response to an eviction request. The HPA only scales up with response to utilization, not a pending eviction. But correct me if I'm wrong, I'm certainly not an expert in this. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#25873 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEYGXNHTDPT2HVQJUDMS6DR5WU2NANCNFSM4PIGMEOQ> .

agilgur5 · 2020-07-27T17:30:12Z

On the k8s UX side, I actually thought a rolling update strategy might apply to the PDB case too (that similarly means a user is ok with some level of scale up), but that's only for the resource's rolling update itself, not if the underlying nodes go through a rolling update. All of these relate to each other, but are independent without knowledge of each other on the k8s side I guess.

istio-testing · 2020-08-16T20:07:14Z

@agilgur5: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rcernich · 2020-08-27T19:00:15Z

I wonder if a better solution is to configure maxSurge on the deployments.

sdake · 2020-09-27T05:32:36Z

I wonder if a better solution is to configure maxSurge on the deployments.

@rcernich Reading https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge - I don't immediately understand your proposal. The default (according to that page) is 25%. This means the max surge during a rolling update with our defaults of 5 max pods = 6 pods. This should enable 1 extra pod for a rolling update. Not sure why we don't see this behavior automatically?

I still believe we have two choices:

Add HA to the default profile

or -

Add an HA profile that is essentially the default profile (perhaps default-ha)

I think default-ha seems like a reasonable approach, although as a project we do not want a HA profile version for every profile in the system.

An entirely different option is to have a conditional set of HA flags such as

istioctl --set high_availability=true or the like.

@ostromart @costinm @esnible @howardjohn can you comment on the above. I think we have punted on this problem for about three years, and its time to take a hard look at solving it.

Cheers,
-steve

howardjohn · 2020-09-28T16:06:41Z

I wonder if a better solution is to configure maxSurge on the deployments.

I 100% agree but I don't think k8s does this: kubernetes/kubernetes#93476. If the issue is invalid and we can do it, please let us know, that would be perfect.

I don't see why we need a profile to configure a single flag. You can just set values.pilot.replicas=5 or whatever the option is?

- PDBs for Istio services have a minAvailable of 1, and HPAs had a minReplicas of 1 (as well as default replicaCount of 1), meaning that the defaults would cause failures for things like kops rolling-update - bump minReplicas to 2 so that at least 1 pod can go down by default - and bump autoScaleMin to 2 as well, which seems to correspond to HPA minReplicas as well - and bump default replicaCount to 2 where it was previously 1

agilgur5 · 2020-10-08T23:07:28Z

I wonder if a better solution is to configure maxSurge on the deployments.

I 100% agree but I don't think k8s does this: kubernetes/kubernetes#93476. If the issue is invalid and we can do it, please let us know, that would be perfect.

Yes, per my previous comment (#25873 (comment)) it doesn't currently do this, maxSurge is strictly a deploy/rollingUpdate configuration, it does not affect the rolling of actual k8s nodes. That is perhaps out-of-scope with what maxSurge is meant to handle (deploys/updates to a service).

There does not seem to be much bite on the k8s issue filed kubernetes/kubernetes#93476. To me, that too seems to perhaps be out-of-scope; HPA is then scaling not for load as it is purposed, but to meet PDB requirements, which also couples the two resources together.

I don't see why we need a profile to configure a single flag. You can just set values.pilot.replicas=5 or whatever the option is?

Per the diff, it's for several services -- personally, I experienced the PDB error with istiod, istio-ingressgateway, and istiocoredns (also applies to istio-egressgateway). And one needs to configure both the replicaCount and the HPA minReplicas.

I still believe we have two choices:

Add HA to the default profile

or -

Add an HA profile that is essentially the default profile (perhaps default-ha)

Per my previous comment (#25873 (comment)), if the goal is to cater to people "trying Istio out", the existing minimal or demo profiles I believe already suffice and were built for that purpose. I'd also argue that's not the primary use-case by definition and that someone trying it out may want to see HA out-of-the-box, or at least not experience bugs such as this.
For "default production usage", I think HA makes sense to have out-of-the-box, as it's production. If one makes the argument that HA requires too much resources, one could take a slippery slope and say the same for security -- but Istio does have a variety of security features out-of-the-box. A lack of either can be considered a misconfiguration and can be exploited.

In any case, the defaults are still currently incorrect and a confusing issue that lots of users run into (including those trying it out).

I've fixed the one review comment about maxReplicas which I had asked for guidance on, but have otherwise not received any guidance on this in months. As such this has been deadlocked while the bug continues to exist.

istio-policy-bot · 2020-11-13T05:04:27Z

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2020-09-28. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

agilgur5 · 2020-11-13T18:45:44Z

I do think this still needs attention, but I can't re-open this PR without making a new one and haven't gotten responses 😕

agilgur5 · 2021-08-17T17:07:01Z

hit this bug again over a year after my original PR, now with Istio 1.10 with a different cluster at a different job... so this is still a bug... 😕

marshallford · 2023-06-30T19:03:13Z

I still have pilot.autoscaleMin set to 2 in my istiod helm chart values for Istio 1.17. I feel your pain @agilgur5.

agilgur5 requested a review from a team as a code owner July 26, 2020 22:56

googlebot added the cla: no Set by the Google CLA bot to indicate the author of a PR has not signed the Google CLA. label Jul 26, 2020

istio-testing added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test labels Jul 26, 2020

googlebot added cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. and removed cla: no Set by the Google CLA bot to indicate the author of a PR has not signed the Google CLA. labels Jul 26, 2020

morvencao reviewed Jul 27, 2020

View reviewed changes

sdake reviewed Jul 27, 2020

View reviewed changes

manifests/profiles/default.yaml Outdated Show resolved Hide resolved

manifests/profiles/default.yaml Show resolved Hide resolved

sdake added the cherrypick/release-1.7 label Jul 27, 2020

howardjohn requested changes Jul 27, 2020

View reviewed changes

istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Aug 16, 2020

istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Aug 27, 2020

istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Aug 27, 2020

agilgur5 mentioned this pull request Sep 7, 2020

Unable to drain k8s node, error due to Pod has more than one PDB, which the eviction subresource does not support #23075

Closed

istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Sep 27, 2020

istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Sep 27, 2020

howardjohn mentioned this pull request Oct 8, 2020

Fix PDB and HPA related issues #27209

Closed

agilgur5 force-pushed the fix-minReplicas-hpa branch from 9fe5b31 to b06be67 Compare October 8, 2020 22:34

istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Oct 29, 2020

istio-policy-bot closed this Nov 13, 2020

istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: change HPA minReplicas to 2 to resolve PDB issues #25873

fix: change HPA minReplicas to 2 to resolve PDB issues #25873

agilgur5 commented Jul 26, 2020 •

edited

Loading

googlebot commented Jul 26, 2020

istio-testing commented Jul 26, 2020

agilgur5 commented Jul 26, 2020

googlebot commented Jul 26, 2020

shamsher31 commented Jul 27, 2020

morvencao left a comment

sdake commented Jul 27, 2020

sdake left a comment

sdake commented Jul 27, 2020

howardjohn left a comment

sdake commented Jul 27, 2020

agilgur5 commented Jul 27, 2020 •

edited

Loading

agilgur5 commented Jul 27, 2020 •

edited

Loading

howardjohn commented Jul 27, 2020 via email

agilgur5 commented Jul 27, 2020

istio-testing commented Aug 16, 2020

rcernich commented Aug 27, 2020

sdake commented Sep 27, 2020

howardjohn commented Sep 28, 2020

agilgur5 commented Oct 8, 2020 •

edited

Loading

istio-policy-bot commented Nov 13, 2020

agilgur5 commented Nov 13, 2020

agilgur5 commented Aug 17, 2021 •

edited

Loading

marshallford commented Jun 30, 2023

fix: change HPA minReplicas to 2 to resolve PDB issues #25873

fix: change HPA minReplicas to 2 to resolve PDB issues #25873

Conversation

agilgur5 commented Jul 26, 2020 • edited Loading

Description

Tags

Review Notes

googlebot commented Jul 26, 2020

What to do if you already signed the CLA

Individual signers

Corporate signers

istio-testing commented Jul 26, 2020

agilgur5 commented Jul 26, 2020

googlebot commented Jul 26, 2020

shamsher31 commented Jul 27, 2020

morvencao left a comment

Choose a reason for hiding this comment

sdake commented Jul 27, 2020

sdake left a comment

Choose a reason for hiding this comment

sdake commented Jul 27, 2020

howardjohn left a comment

Choose a reason for hiding this comment

sdake commented Jul 27, 2020

agilgur5 commented Jul 27, 2020 • edited Loading

agilgur5 commented Jul 27, 2020 • edited Loading

howardjohn commented Jul 27, 2020 via email

agilgur5 commented Jul 27, 2020

istio-testing commented Aug 16, 2020

rcernich commented Aug 27, 2020

sdake commented Sep 27, 2020

howardjohn commented Sep 28, 2020

agilgur5 commented Oct 8, 2020 • edited Loading

istio-policy-bot commented Nov 13, 2020

agilgur5 commented Nov 13, 2020

agilgur5 commented Aug 17, 2021 • edited Loading

marshallford commented Jun 30, 2023

agilgur5 commented Jul 26, 2020 •

edited

Loading

agilgur5 commented Jul 27, 2020 •

edited

Loading

agilgur5 commented Jul 27, 2020 •

edited

Loading

agilgur5 commented Oct 8, 2020 •

edited

Loading

agilgur5 commented Aug 17, 2021 •

edited

Loading