-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move liveness checks to user container port rather than QP port #12479
Conversation
Otherwise these get intercepted by the drainer logic in queue proxy, which is not what we want. Initially liveness and readiness were routed via the queue proxy so that a single check could check both user container and queue proxy; while this makes some sense for Readiness (because we need this for optimisations in activator that need to check readiness of the whole pod at once), for liveness this is both not needed and actively harmful. A failing liveness probe results in Kubernetes restarting the failing container (in this case, the user container), but if the failure comes from QP this will not help (and is actively unhelpful). TL;dr: If we want to add a liveness probe to Queue Proxy, we should add that directly rather than modifying the user's probe to go via the Queue Proxy port.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: julz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov Report
@@ Coverage Diff @@
## main #12479 +/- ##
==========================================
+ Coverage 87.43% 87.44% +0.01%
==========================================
Files 195 195
Lines 9655 9671 +16
==========================================
+ Hits 8442 8457 +15
Misses 930 930
- Partials 283 284 +1
Continue to review full report at Codecov.
|
Yeah we could certainly use some more coverage for liveness probes (although after this change, happily, knative should be doing much much less that could break them)! Let me create an issue Edit: => #12480 |
/lgtm |
Can we cherry pick this change to older branches? I haven't done the digging to see where the regression was introduced. |
Issue mentioned v1.1 so let's do that at the minimum |
@dprotaso: cannot checkout In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
whoops |
@dprotaso: cannot checkout In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
sigh - I obviously don't know our branch names lol /cherry-pick release-1.1 |
@dprotaso: new pull request created: #12498 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Looks like the change was introduced in #12318 , so v1.1 should be the only cherry-pick needed. |
.. Otherwise these get intercepted by the drainer logic in queue proxy, which is not what we want.
This caused the issue in #12462 since we ended up with a liveness probe hitting the QP health check logic, which had worked in the past, but now fails since QP's health checks no longer delegate to the user container if the header isn't passed (which is how this worked before).
We could fix #12462 by making QP delegate liveness checks as it did in the past, but actually I think this was always a mistake to do: QP shouldn't be involved with liveness checks in the first place (note this is already the case for TCP liveness probes and the default case where there is no liveness probe).
Initially liveness and readiness were routed via the queue proxy so that a single check could check both user container and queue proxy; while this makes some sense for Readiness (because we need this for optimisations in activator that need to check readiness of the whole pod at once), for liveness this is both not needed and actively harmful. A failing liveness probe results in Kubernetes restarting the container the probe is defined on (in this case, the user container), but if the failure comes from QP this will not help (and is actively unhelpful since the wrong container will be restarted).
(If we want to add a liveness probe to Queue Proxy, we should add that directly rather than modifying the user's probe to go via the Queue Proxy port, that way the right thing will be restarted if there's a deadlock in QP - but we can do that separately).
Fixes #12462.
Release Note