Skip to content
This repository has been archived by the owner on Oct 7, 2020. It is now read-only.

Proposal to increase default replica counts when default PDB is in use #212

Closed

Conversation

Lysholm
Copy link
Contributor

@Lysholm Lysholm commented May 28, 2019

This is a proposal for "fixing": istio/istio#12602

I based the logic on what i found in gateways/istio-egress.

This is just a proposal, another option could be to set defaultPodDisruptionBudget.enabled = false in global.yaml or update documentation to the effect that default installs must be scaled before node maintenance is possible.

Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why the ones with HPA need changing. They can already scale up automatically with this?

I always thought the replicas: 1 was not a hard limit. For instance, if we changed the deployment a new instance would be scaled up before the old one was brought down, giving it 2 replicas for a bit? Is this not the case? The PDB says MIN of 1, not max of 1? Am I missing something?

autoscaleEnabled: true
autoscaleMin: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem like this needs to be commented out, or changed to 2. This affects autoscaling so it can be set to 1 and scale up to 2 fine

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect the HPA and PDB logic play too well together... The former scales on load, the latter prevents scale to zero. If there is no load I'd expect HPA scales to autoscaleMin, if there is only one replica PDB prevents node drain (to keep that one instance).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question... Is the default supposed to be highly available? @costinm?

If so we'd need an autoscaleMin of 3. Otherwise we would only account for planned downtime like node maintenance, not for a failure during planning downtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes won't scale up for the node drain? I am a bit suspicious of that, with a caveat that I have never tried it or seen docs on it, so it is mostly just a guess.

Setting our default replicas to 2 is not a trivial change for an issue that may or may not exist for other users. At the least, can you provide some references that this is the right fix for this issue?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't try either. My expectation towards Kubernetes is just that they haven't coupled HPA and PDB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm it hadn't drained the HPA scaled pods after 2 hours on GKE 1.12.7 and GKE 1.13.6. I may have misconfigured something, but it was a pretty bog standard clean cluster created for the test, ditto istio install.

I may have had a pod without HPA on the node aswell and that prevented some scheduling logic, I will retest with only HPA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howardjohn: After scanning through the PRs and issues related to introducing the PDB I'm less sure of what's the correct solution here. They were introduced to alleviated downtime during upgrade and that's a great use case. But i think it was a mistake to enable them by default.

Disruption budgets is to me conceptually something you layer on top of HA when disruption in planned maintenance windows also need to be mitigated. As it's set up today it might give false impression to users. I can't speak for anyone else but i think it's common to test involuntary disruption by simulating it through voluntary disruptions.

IE in the original issue that lead to default disruption budgets: istio/istio#9951
It seems this user has only one replica running and it was critical for them not to interrupt istio during maintenance. I'd wager it's more critical for them not to have it interrupted outside maintenance, but that possibility may have been obscured for them.

The other option, HA by default, is not something i'd argue for. In addition to being, as @loewenstein said, 3+ replicas you'd also expect anti-affinity in there and that's rather hard to have good configs for given all possible cluster topologies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the default PDB will prevent k8s node drain if the replica if selected pod is 1, so we need to be careful of HPA and PDB combination. There was an issue for cert-manager, the default replica is 1 while minAvailable of PDB is also 1, that would prevent node upgrade/maintain...

Maybe we need clear doc(at least inline comments) to end users about this.

@@ -1,6 +1,6 @@
galley:
image: galley
replicaCount: 1
# replicaCount: 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can galley safely run with multiple replicas? @costinm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ozevren: I'm running 3 replicas in production and nothing has critically failed atleast. Degredation or edge cases below my "detection threshold" might still be in there. Anything you want me to look for specifically?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did test galley with multiple replicas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we need to confirm the stability cluster runs multiple replicas of everything - I tested manually long ago.

@linsun
Copy link
Member

linsun commented May 29, 2019

if our default profile is for production usage, it does make sense to have replicas 2 for each control plane components. who would run with 1 replica for each of istio control plane component in prod? (I'm not sure if citadel can handle 2 replicas in 1.2, it can't in 1.1).

@loewenstein
Copy link

@linsun as I have pointed out earlier, for production use we probably want a minimum of 3 replica and a PDB.

Otherwise the Istio control plane wouldn't be HA during node maintenance.

@linsun
Copy link
Member

linsun commented May 29, 2019

Istio control plane isn't in the user's request flow (except mixer policy) so it should be okay if it is down during maintenance so I think down to 1 replica is a reasonable starting point during upgrade.

@Lysholm
Copy link
Contributor Author

Lysholm commented May 29, 2019

I think this needs some more thought. These components shouldn't all be treated as equal as @linsun said.

The best example is certmanager, it takes certmanager between 10-15 seconds to start up and 1-5 minutes to process cert requests using letsencrypt and up to days with other issuers. Is the, worst case, 15% extra latency for cert issuance something you'd want to guard against? Maybe, but we're talking edge case scenario here. Probably not a candidate for disruption budget or HA by default.

The mixer and anything connected to the Adapter API is on the other end of the spectrum. Would i accept logging falling out during maintenance? Probably not.

Is it a goal that the default install should be ready for production out of box? In that case there's more work to be done. There's nothing stopping k8s from scheduling all mixer pods on the same node and in the same zone with the current config. So a single node failure can still take everything down and with the popularity of spot/preemtible nodes this is actually not that unlikeley.

@loewenstein
Copy link

@linsun well, while Pilot is down any freshly scheduled pod would be blocked on Envoy waiting for config, wouldn't it?

Maybe that's an edge case, as multiple things have to go wrong to trigger failure, but node maintenance is exactly one of those scenarios where a couple of Pod reschedules are pretty likely...

@howardjohn
Copy link
Member

@linsun well, while Pilot is down any freshly scheduled pod would be blocked on Envoy waiting for config, wouldn't it?

Maybe that's an edge case, as multiple things have to go wrong to trigger failure, but node maintenance is exactly one of those scenarios where a couple of Pod reschedules are pretty likely...

You also need the sidecar injector (and maybe citadel?) for new pods, as an extra data point

@loewenstein
Copy link

@howardjohn @linsun Pilot was an example. Sure we need injector and probably Gateways. While I do think it is unfortunate that Citadel is currently a singleton, it is probably the least problematic with regard to HA - not sure if SDS changes that...

@Lysholm
Copy link
Contributor Author

Lysholm commented May 29, 2019

What about adding new global options like this:

global.HA = {
  controlplane: {
    enabled: false,
    podAntiAffinity: {
       preferredTopologyKeys: [],
       requiredTopologyKeys: [],
    }
  },
  mesh: {
    enabled: true, 
    podAntiAffinity: {
      preferredTopologyKeys: ["failure-domain.beta.kubernetes.io/zone"]
      requiredTopologyKeys: ["kubernetes.io/hostname"]
    }
  }

enabling HA for mesh, enables PDBs and replicas on components that affect traffic in the mesh, while controlplane enables it on components required to make changes to the mesh.
The podAntiAffinity could be naively configured to do the best it could without knowing the cluster topology:

podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                    - istio-ingress
              topologyKey: "failure-domain.beta.kubernetes.io/zone"
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                    - istio-ingress
            topologyKey: "kubernetes.io/hostname"

It still doesn't give regional HA. In the case zone-1 goes down while maintenance on zone-2 has all nodes down there then all istio-ingress pods will be rescheduled in zone-3. It's the best i can come up with without requiring user config. This would be a large change, since antiAffinity like this increases scheduling load.

@rvandegrift
Copy link

rvandegrift commented May 30, 2019

The underlying issue can prevent cluster autoscaler from removing unutilized nodes. I'm running istio 1.0 from the helm chart. The singleton istio control plane pods happen to be on under utilized nodes, but CA cannot evict them due to the PDB. So the node pool will never scale down.

This is an unfortunate default, and has been very difficult to track down. I'd be very happy to have the option to require at least 2 replicas when a PDB could prevent autoscaling. Alternatively, a flag to disable the PDBs could work too.

IMO, this is a separate issue from enabling solid HA deployments from the helm charts.

@Lysholm
Copy link
Contributor Author

Lysholm commented May 31, 2019

@rvandegrift: Such a flag exists already: global.defaultPodDisruptionBudget.enabled = false. The only other change that could be done for 1.1 or 1.0 is to flag this as a known issue in the documentation. Any change of default behavior here would be a breaking change.

@rvandegrift
Copy link

@rvandegrift: Such a flag exists already: global.defaultPodDisruptionBudget.enabled = false.

Oh thanks - this does exist in 1.1, but not 1.0 so I hadn't noticed. That's a pretty good reason for us to upgrade.

The only other change that could be done for 1.1 or 1.0 is to flag this as a known issue in the documentation. Any change of default behavior here would be a breaking change.

Fair enough, though I'd think there's still value in fixing a future release. I haven't gone through the history to understand the current defaults, but I wonder how many folks would choose the existing situation over defaults with a working cluster autoscaler.

@@ -10,7 +10,15 @@ metadata:
release: {{ .Release.Name }}
spec:
maxReplicas: {{ $gateway.autoscaleMax }}
{{- if $gateway.autoscaleMin }}
Copy link
Member

@morvencao morvencao Jun 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still an issue here, if autoscaleMin is 1 and the actual replica is 1, in that case, the default PDB(with minAvailable=1 ) still prevents node drain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just document that if pod disruption budget is set, min must be set to at least 2 ?
It seems like a lot of complexity - normal k8s configs just work.

Keep in mind we want to start using kustomize as well - where we don't have all the template magic.

@morvencao
Copy link
Member

@rvandegrift PDB is introduced Istio from v1.1, there is flag to disable the PDBs, but the spec(minAvailable: 1) can't be customized: https://github.com/istio/istio/blob/release-1.1/install/kubernetes/helm/istio/values.yaml#L370-L375

@costinm
Copy link
Contributor

costinm commented Jun 27, 2019

Sorry, post vacation and long thread, what is the plan for this PR ?

I believe we MUST have both a 'demo' profile with low resource as well as 2-3 easy 'prod' installs, with
reasonable defaults. And at least one of the profiles needs to have a proper 2-3 replicas for each component, and be tested in the stability cluster.

Copy link
Contributor

@costinm costinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's sync up on next env meetings - either close or merge.

I'm trying to avoid PRs stuck for >2 weeks. If we can't find an agreement or there are doubts - we can close and reopen when we have agreement.

@Lysholm
Copy link
Contributor Author

Lysholm commented Jul 9, 2019

@costinm: That is fair. Closing this, this is not how i would do it or want istio setup. My current istio setup scripts set global.defaultPodDisruptionBudget.enabled = false. And apply the following PDBs afterwards:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  labels:
    app: istio-egressgateway
    istio: egressgateway
    release: istio
  name: istio-egressgateway
  namespace: istio-system
spec:
  maxUnavailable: 35%
  selector:
    matchLabels:
      app: istio-egressgateway
      istio: egressgateway
      release: istio
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  labels:
    app: istio-ingressgateway
    istio: ingressgateway
    release: istio
  name: istio-ingressgateway
  namespace: istio-system
spec:
  maxUnavailable: 35%
  selector:
    matchLabels:
      app: istio-ingressgateway
      istio: ingressgateway
      release: istio
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  labels:
    app: policy
    istio: mixer
    istio-mixer-type: policy
    release: istio
  name: istio-policy
  namespace: istio-system
spec:
  maxUnavailable: 35%
  selector:
    matchLabels:
      app: policy
      istio: mixer
      istio-mixer-type: policy
      release: istio
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  labels:
    app: telemetry
    istio: mixer
    istio-mixer-type: telemetry
    release: istio
  name: istio-telemetry
  namespace: istio-system
spec:
  maxUnavailable: 35%
  selector:
    matchLabels:
      app: telemetry
      istio: mixer
      istio-mixer-type: telemetry
      release: istio

Those four components are installed with the following helm values:

    autoscaleMin: 3
    podAntiAffinityLabelSelector:
    - key: app
      operator: In
      values: $ISTIO-COMPONENT
      topologyKey: "kubernetes.io/hostname"
    podAntiAffinityTermLabelSelector:
    - key: app
      operator: In
      values: $ISTIO-COMPONENT
      topologyKey: "failure-domain.beta.kubernetes.io/zone"

That gets me HA on all components in the request path while conserving resources on the control plane components and most importantly no interference with node maintenance.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
9 participants