Proposal to increase default replica counts when default PDB is in use #212

Lysholm · 2019-05-28T13:44:08Z

This is a proposal for "fixing": istio/istio#12602

I based the logic on what i found in gateways/istio-egress.

This is just a proposal, another option could be to set defaultPodDisruptionBudget.enabled = false in global.yaml or update documentation to the effect that default installs must be scaled before node maintenance is possible.

howardjohn

I am not sure why the ones with HPA need changing. They can already scale up automatically with this?

I always thought the replicas: 1 was not a hard limit. For instance, if we changed the deployment a new instance would be scaled up before the old one was brought down, giving it 2 replicas for a bit? Is this not the case? The PDB says MIN of 1, not max of 1? Am I missing something?

howardjohn · 2019-05-28T14:31:06Z

gateways/istio-ingress/values.yaml

 autoscaleEnabled: true
- autoscaleMin: 1


Doesn't seem like this needs to be commented out, or changed to 2. This affects autoscaling so it can be set to 1 and scale up to 2 fine

I wouldn't expect the HPA and PDB logic play too well together... The former scales on load, the latter prevents scale to zero. If there is no load I'd expect HPA scales to autoscaleMin, if there is only one replica PDB prevents node drain (to keep that one instance).

Another question... Is the default supposed to be highly available? @costinm?

If so we'd need an autoscaleMin of 3. Otherwise we would only account for planned downtime like node maintenance, not for a failure during planning downtime.

Kubernetes won't scale up for the node drain? I am a bit suspicious of that, with a caveat that I have never tried it or seen docs on it, so it is mostly just a guess.

Setting our default replicas to 2 is not a trivial change for an issue that may or may not exist for other users. At the least, can you provide some references that this is the right fix for this issue?

I didn't try either. My expectation towards Kubernetes is just that they haven't coupled HPA and PDB.

I can confirm it hadn't drained the HPA scaled pods after 2 hours on GKE 1.12.7 and GKE 1.13.6. I may have misconfigured something, but it was a pretty bog standard clean cluster created for the test, ditto istio install.

I may have had a pod without HPA on the node aswell and that prevented some scheduling logic, I will retest with only HPA.

@howardjohn: After scanning through the PRs and issues related to introducing the PDB I'm less sure of what's the correct solution here. They were introduced to alleviated downtime during upgrade and that's a great use case. But i think it was a mistake to enable them by default.

Disruption budgets is to me conceptually something you layer on top of HA when disruption in planned maintenance windows also need to be mitigated. As it's set up today it might give false impression to users. I can't speak for anyone else but i think it's common to test involuntary disruption by simulating it through voluntary disruptions.

IE in the original issue that lead to default disruption budgets: istio/istio#9951
It seems this user has only one replica running and it was critical for them not to interrupt istio during maintenance. I'd wager it's more critical for them not to have it interrupted outside maintenance, but that possibility may have been obscured for them.

The other option, HA by default, is not something i'd argue for. In addition to being, as @loewenstein said, 3+ replicas you'd also expect anti-affinity in there and that's rather hard to have good configs for given all possible cluster topologies.

Yes, the default PDB will prevent k8s node drain if the replica if selected pod is 1, so we need to be careful of HPA and PDB combination. There was an issue for cert-manager, the default replica is 1 while minAvailable of PDB is also 1, that would prevent node upgrade/maintain...

Maybe we need clear doc(at least inline comments) to end users about this.

howardjohn · 2019-05-28T14:32:28Z

istio-control/istio-config/values.yaml

@@ -1,6 +1,6 @@
 galley:
 image: galley
- replicaCount: 1
+ # replicaCount: 2


Can galley safely run with multiple replicas? @costinm

cc @ozevren

Have you tested this?

@ozevren: I'm running 3 replicas in production and nothing has critically failed atleast. Degredation or edge cases below my "detection threshold" might still be in there. Anything you want me to look for specifically?

I did test galley with multiple replicas.

But we need to confirm the stability cluster runs multiple replicas of everything - I tested manually long ago.

linsun · 2019-05-29T14:27:44Z

if our default profile is for production usage, it does make sense to have replicas 2 for each control plane components. who would run with 1 replica for each of istio control plane component in prod? (I'm not sure if citadel can handle 2 replicas in 1.2, it can't in 1.1).

loewenstein · 2019-05-29T14:39:57Z

@linsun as I have pointed out earlier, for production use we probably want a minimum of 3 replica and a PDB.

Otherwise the Istio control plane wouldn't be HA during node maintenance.

linsun · 2019-05-29T14:50:21Z

Istio control plane isn't in the user's request flow (except mixer policy) so it should be okay if it is down during maintenance so I think down to 1 replica is a reasonable starting point during upgrade.

Lysholm · 2019-05-29T15:55:56Z

I think this needs some more thought. These components shouldn't all be treated as equal as @linsun said.

The best example is certmanager, it takes certmanager between 10-15 seconds to start up and 1-5 minutes to process cert requests using letsencrypt and up to days with other issuers. Is the, worst case, 15% extra latency for cert issuance something you'd want to guard against? Maybe, but we're talking edge case scenario here. Probably not a candidate for disruption budget or HA by default.

The mixer and anything connected to the Adapter API is on the other end of the spectrum. Would i accept logging falling out during maintenance? Probably not.

Is it a goal that the default install should be ready for production out of box? In that case there's more work to be done. There's nothing stopping k8s from scheduling all mixer pods on the same node and in the same zone with the current config. So a single node failure can still take everything down and with the popularity of spot/preemtible nodes this is actually not that unlikeley.

loewenstein · 2019-05-29T17:12:40Z

@linsun well, while Pilot is down any freshly scheduled pod would be blocked on Envoy waiting for config, wouldn't it?

Maybe that's an edge case, as multiple things have to go wrong to trigger failure, but node maintenance is exactly one of those scenarios where a couple of Pod reschedules are pretty likely...

howardjohn · 2019-05-29T17:20:27Z

@linsun well, while Pilot is down any freshly scheduled pod would be blocked on Envoy waiting for config, wouldn't it?

Maybe that's an edge case, as multiple things have to go wrong to trigger failure, but node maintenance is exactly one of those scenarios where a couple of Pod reschedules are pretty likely...

You also need the sidecar injector (and maybe citadel?) for new pods, as an extra data point

loewenstein · 2019-05-29T18:07:26Z

@howardjohn @linsun Pilot was an example. Sure we need injector and probably Gateways. While I do think it is unfortunate that Citadel is currently a singleton, it is probably the least problematic with regard to HA - not sure if SDS changes that...

Lysholm · 2019-05-29T20:19:25Z

What about adding new global options like this:

global.HA = {
  controlplane: {
    enabled: false,
    podAntiAffinity: {
       preferredTopologyKeys: [],
       requiredTopologyKeys: [],
    }
  },
  mesh: {
    enabled: true, 
    podAntiAffinity: {
      preferredTopologyKeys: ["failure-domain.beta.kubernetes.io/zone"]
      requiredTopologyKeys: ["kubernetes.io/hostname"]
    }
  }

enabling HA for mesh, enables PDBs and replicas on components that affect traffic in the mesh, while controlplane enables it on components required to make changes to the mesh.
The podAntiAffinity could be naively configured to do the best it could without knowing the cluster topology:

podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                    - istio-ingress
              topologyKey: "failure-domain.beta.kubernetes.io/zone"
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                    - istio-ingress
            topologyKey: "kubernetes.io/hostname"

It still doesn't give regional HA. In the case zone-1 goes down while maintenance on zone-2 has all nodes down there then all istio-ingress pods will be rescheduled in zone-3. It's the best i can come up with without requiring user config. This would be a large change, since antiAffinity like this increases scheduling load.

rvandegrift · 2019-05-30T01:29:41Z

The underlying issue can prevent cluster autoscaler from removing unutilized nodes. I'm running istio 1.0 from the helm chart. The singleton istio control plane pods happen to be on under utilized nodes, but CA cannot evict them due to the PDB. So the node pool will never scale down.

This is an unfortunate default, and has been very difficult to track down. I'd be very happy to have the option to require at least 2 replicas when a PDB could prevent autoscaling. Alternatively, a flag to disable the PDBs could work too.

IMO, this is a separate issue from enabling solid HA deployments from the helm charts.

Lysholm · 2019-05-31T13:12:36Z

@rvandegrift: Such a flag exists already: global.defaultPodDisruptionBudget.enabled = false. The only other change that could be done for 1.1 or 1.0 is to flag this as a known issue in the documentation. Any change of default behavior here would be a breaking change.

rvandegrift · 2019-05-31T22:50:48Z

@rvandegrift: Such a flag exists already: global.defaultPodDisruptionBudget.enabled = false.

Oh thanks - this does exist in 1.1, but not 1.0 so I hadn't noticed. That's a pretty good reason for us to upgrade.

The only other change that could be done for 1.1 or 1.0 is to flag this as a known issue in the documentation. Any change of default behavior here would be a breaking change.

Fair enough, though I'd think there's still value in fixing a future release. I haven't gone through the history to understand the current defaults, but I wonder how many folks would choose the existing situation over defaults with a working cluster autoscaler.

morvencao · 2019-06-01T12:25:35Z

gateways/istio-egress/templates/autoscale.yaml

@@ -10,7 +10,15 @@ metadata:
 release: {{ .Release.Name }}
 spec:
 maxReplicas: {{ $gateway.autoscaleMax }}
+{{- if $gateway.autoscaleMin }}


still an issue here, if autoscaleMin is 1 and the actual replica is 1, in that case, the default PDB(with minAvailable=1 ) still prevents node drain.

Can we just document that if pod disruption budget is set, min must be set to at least 2 ?
It seems like a lot of complexity - normal k8s configs just work.

Keep in mind we want to start using kustomize as well - where we don't have all the template magic.

morvencao · 2019-06-01T13:51:36Z

@rvandegrift PDB is introduced Istio from v1.1, there is flag to disable the PDBs, but the spec(minAvailable: 1) can't be customized: https://github.com/istio/istio/blob/release-1.1/install/kubernetes/helm/istio/values.yaml#L370-L375

costinm · 2019-06-27T00:17:32Z

Sorry, post vacation and long thread, what is the plan for this PR ?

I believe we MUST have both a 'demo' profile with low resource as well as 2-3 easy 'prod' installs, with
reasonable defaults. And at least one of the profiles needs to have a proper 2-3 replicas for each component, and be tested in the stability cluster.

costinm

Let's sync up on next env meetings - either close or merge.

I'm trying to avoid PRs stuck for >2 weeks. If we can't find an agreement or there are doubts - we can close and reopen when we have agreement.

Lysholm · 2019-07-09T21:50:04Z

@costinm: That is fair. Closing this, this is not how i would do it or want istio setup. My current istio setup scripts set global.defaultPodDisruptionBudget.enabled = false. And apply the following PDBs afterwards:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  labels:
    app: istio-egressgateway
    istio: egressgateway
    release: istio
  name: istio-egressgateway
  namespace: istio-system
spec:
  maxUnavailable: 35%
  selector:
    matchLabels:
      app: istio-egressgateway
      istio: egressgateway
      release: istio
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  labels:
    app: istio-ingressgateway
    istio: ingressgateway
    release: istio
  name: istio-ingressgateway
  namespace: istio-system
spec:
  maxUnavailable: 35%
  selector:
    matchLabels:
      app: istio-ingressgateway
      istio: ingressgateway
      release: istio
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  labels:
    app: policy
    istio: mixer
    istio-mixer-type: policy
    release: istio
  name: istio-policy
  namespace: istio-system
spec:
  maxUnavailable: 35%
  selector:
    matchLabels:
      app: policy
      istio: mixer
      istio-mixer-type: policy
      release: istio
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  labels:
    app: telemetry
    istio: mixer
    istio-mixer-type: telemetry
    release: istio
  name: istio-telemetry
  namespace: istio-system
spec:
  maxUnavailable: 35%
  selector:
    matchLabels:
      app: telemetry
      istio: mixer
      istio-mixer-type: telemetry
      release: istio

Those four components are installed with the following helm values:

    autoscaleMin: 3
    podAntiAffinityLabelSelector:
    - key: app
      operator: In
      values: $ISTIO-COMPONENT
      topologyKey: "kubernetes.io/hostname"
    podAntiAffinityTermLabelSelector:
    - key: app
      operator: In
      values: $ISTIO-COMPONENT
      topologyKey: "failure-domain.beta.kubernetes.io/zone"

That gets me HA on all components in the request path while conserving resources on the control plane components and most importantly no interference with node maintenance.

Lysholm added 2 commits May 28, 2019 12:21

Add replicaCount to certmanager chart deployment template

33ff3e6

Add logic for replica count and min replicas when default PDB is used

f3af26e

Lysholm requested review from costinm, howardjohn, linsun and sdake as code owners May 28, 2019 13:44

howardjohn suggested changes May 28, 2019

View reviewed changes

morvencao reviewed Jun 1, 2019

View reviewed changes

costinm reviewed Jul 9, 2019

View reviewed changes

Lysholm closed this Jul 9, 2019

linkvt mentioned this pull request Jul 3, 2020

Increase istio replicas to 2+ for better availability and because of PodDisruptionBudgets kubeflow/manifests#1343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to increase default replica counts when default PDB is in use #212

Proposal to increase default replica counts when default PDB is in use #212

Lysholm commented May 28, 2019

howardjohn left a comment

howardjohn May 28, 2019

loewenstein May 28, 2019

loewenstein May 28, 2019

howardjohn May 28, 2019

loewenstein May 28, 2019

Lysholm May 28, 2019

Lysholm May 29, 2019

morvencao Jun 1, 2019

howardjohn May 28, 2019

ayj Jun 13, 2019

ozevren Jun 13, 2019

Lysholm Jun 25, 2019

costinm Jul 9, 2019

costinm Jul 9, 2019

linsun commented May 29, 2019

loewenstein commented May 29, 2019

linsun commented May 29, 2019

Lysholm commented May 29, 2019

loewenstein commented May 29, 2019

howardjohn commented May 29, 2019

loewenstein commented May 29, 2019

Lysholm commented May 29, 2019

rvandegrift commented May 30, 2019 •

edited

Loading

Lysholm commented May 31, 2019

rvandegrift commented May 31, 2019

morvencao Jun 1, 2019 •

edited

Loading

costinm Jul 9, 2019

morvencao commented Jun 1, 2019

costinm commented Jun 27, 2019

costinm left a comment

Lysholm commented Jul 9, 2019

Proposal to increase default replica counts when default PDB is in use #212

Proposal to increase default replica counts when default PDB is in use #212

Conversation

Lysholm commented May 28, 2019

howardjohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linsun commented May 29, 2019

loewenstein commented May 29, 2019

linsun commented May 29, 2019

Lysholm commented May 29, 2019

loewenstein commented May 29, 2019

howardjohn commented May 29, 2019

loewenstein commented May 29, 2019

Lysholm commented May 29, 2019

rvandegrift commented May 30, 2019 • edited Loading

Lysholm commented May 31, 2019

rvandegrift commented May 31, 2019

morvencao Jun 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morvencao commented Jun 1, 2019

costinm commented Jun 27, 2019

costinm left a comment

Choose a reason for hiding this comment

Lysholm commented Jul 9, 2019

rvandegrift commented May 30, 2019 •

edited

Loading

morvencao Jun 1, 2019 •

edited

Loading