Support zero downtime rollout restart of target deployments. #529

dangarthwaite · 2023-02-14T01:40:38Z

What happened?

Doing a rollout restart of the verify service results in a small window of downtime.

What did you expect to happen?

rollout restarts of a targeted application should have zero failed requests.

How'd it happen?

$ kubectl rollout restart -n ingress deploy/verify &&
  while sleep .25; do 
  curl -sv https://verify.example.com/healthcheck 2>&1 | grep -E '^< '; 
done
deployment.apps/verify restarted
< HTTP/2 302
< date: Mon, 13 Feb 2023 22:55:54 GMT
< content-type: text/html; charset=UTF-8
< content-length: 1411
< location: https://sso.example.com/.pomerium/sign_in?pomerium_expiry=1676329254&pomerium_idp_id=C3HESNcvnqS4eqcjUna9rVax8spGEWe2sC8F65GTt2ip&pomerium_issued=1676328954&pomerium_redirect_uri=https%3A%2F%2Fverify.example.com%2Fhealthcheck&pomerium_signature=3-t-KBAxkDVUDmG29m5yUifNAnatGs_qwvB5cmnl58w%3D
< x-pomerium-intercepted-response: true
< x-frame-options: SAMEORIGIN
< x-xss-protection: 1; mode=block
< server: envoy
< x-request-id: 6eda7b91-d88f-46e0-be2c-b265fcbebe88
<
< HTTP/2 404
< date: Mon, 13 Feb 2023 22:55:54 GMT
< content-type: text/html; charset=UTF-8
< content-length: 1414
< x-pomerium-intercepted-response: true
< x-frame-options: SAMEORIGIN
< x-xss-protection: 1; mode=block
< server: envoy
< x-request-id: b53bb012-9725-4174-b7a1-b566934fd3a4
<
< HTTP/2 302
< date: Mon, 13 Feb 2023 22:55:55 GMT
< content-type: text/html; charset=UTF-8
< content-length: 1411
< location: https://sso.example.com/.pomerium/sign_in?pomerium_expiry=1676329255&pomerium_idp_id=C3HESNcvnqS4eqcjUna9rVax8spGEWe2sC8F65GTt2ip&pomerium_issued=1676328955&pomerium_redirect_uri=https%3A%2F%2Fverify.example.com%2Fhealthcheck&pomerium_signature=NsZdyX7CDZJjKGpro7tebeQoGmrI2r53jZzSn_av2Dc%3D
< x-pomerium-intercepted-response: true
< x-frame-options: SAMEORIGIN
< x-xss-protection: 1; mode=block
< server: envoy
< x-request-id: 17ab805d-9fac-4eb5-8e3a-8608dc855d36

What's your environment like?

$ kubectl -n ingress get deploy/pomerium -o yaml | yq '.spec.template.spec.containers[0].image'
pomerium/ingress-controller:sha-cdc389c

$ kubectl get nodes -o yaml | yq '.items[-1] | .status.nodeInfo'
architecture: arm64
bootID: 247a7d1c-b579-4f89-b1b6-2b98883e4150
containerRuntimeVersion: docker:https://20.10.17
kernelVersion: 5.4.226-129.415.amzn2.aarch64
kubeProxyVersion: v1.21.14-eks-fb459a0
kubeletVersion: v1.21.14-eks-fb459a0
machineID: ec2d3bd9b9ea252972a242d7da68e233
operatingSystem: linux
osImage: Amazon Linux 2
systemUUID: ec2d3bd9-b9ea-2529-72a2-42d7da68e233

What's your config.yaml?

apiVersion: ingress.pomerium.io/v1
kind: Pomerium
metadata:
  name: global
spec:
  authenticate:
    url: https://sso.example.com
  certificates:
  - ingress/tls-wildcards
  identityProvider:
    provider: google
    secret: ingress/google-idp-creds
  secrets: ingress/bootstrap
status:
  ingress:
    ingress/verify:
      observedAt: "2023-02-13T22:55:54Z"
      observedGeneration: 2
      reconciled: true
    sandbox/example-ingress:
      observedAt: "2023-02-13T22:28:14Z"
      observedGeneration: 6
      reconciled: true
  settingsStatus:
    observedAt: "2023-02-10T15:33:57Z"
    observedGeneration: 5
    reconciled: true
    warnings:
    - 'storage: please specify a persistent storage backend, please see https://www.pomerium.com/docs/topics/data-storage#persistence'

What did you see in the logs?

{
  "level": "info",
  "service": "envoy",
  "upstream-cluster": "",
  "method": "GET",
  "authority": "verify.ops.bereal.me",
  "path": "/healthcheck",
  "user-agent": "curl/7.81.0",
  "referer": "",
  "forwarded-for": "71.254.0.45,10.123.60.7",
  "request-id": "fb2011be-f4d0-47bd-9094-8ad12f583009",
  "duration": 14.398539,
  "size": 1414,
  "response-code": 404,
  "response-code-details": "ext_authz_denied",
  "time": "2023-02-14T01:34:20Z",
  "message": "http-request"
}

The text was updated successfully, but these errors were encountered:

wasaga · 2023-02-14T01:53:57Z

currently pomerium uses Service Endpoints object, that is updated once Pods are terminated or new ones become Ready. The update takes a bit of time which is the root cause of the downtime.

One current option to avoid the short downtime window should be to use kubernetes service proxy instead, see https://www.pomerium.com/docs/deploying/k8s/ingress#service-proxy

in the long term, we probably we should start using a newer EndpointSlice object that takes the pod conditions into the consideration. https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/#conditions

wasaga added the help wanted Extra attention is needed label Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support zero downtime rollout restart of target deployments. #529

Support zero downtime rollout restart of target deployments. #529

dangarthwaite commented Feb 14, 2023

wasaga commented Feb 14, 2023

Support zero downtime rollout restart of target deployments. #529

Support zero downtime rollout restart of target deployments. #529

Comments

dangarthwaite commented Feb 14, 2023

What happened?

What did you expect to happen?

How'd it happen?

What's your environment like?

What's your config.yaml?

What did you see in the logs?

wasaga commented Feb 14, 2023