Move liveness checks to user container port rather than QP port #12479

julz · 2022-01-06T13:26:38Z

.. Otherwise these get intercepted by the drainer logic in queue proxy, which is not what we want.

This caused the issue in #12462 since we ended up with a liveness probe hitting the QP health check logic, which had worked in the past, but now fails since QP's health checks no longer delegate to the user container if the header isn't passed (which is how this worked before).

We could fix #12462 by making QP delegate liveness checks as it did in the past, but actually I think this was always a mistake to do: QP shouldn't be involved with liveness checks in the first place (note this is already the case for TCP liveness probes and the default case where there is no liveness probe).

Initially liveness and readiness were routed via the queue proxy so that a single check could check both user container and queue proxy; while this makes some sense for Readiness (because we need this for optimisations in activator that need to check readiness of the whole pod at once), for liveness this is both not needed and actively harmful. A failing liveness probe results in Kubernetes restarting the container the probe is defined on (in this case, the user container), but if the failure comes from QP this will not help (and is actively unhelpful since the wrong container will be restarted).

(If we want to add a liveness probe to Queue Proxy, we should add that directly rather than modifying the user's probe to go via the Queue Proxy port, that way the right thing will be restarted if there's a deadlock in QP - but we can do that separately).

Fixes #12462.

Release Note

Changes liveness probes to directly probe the user container rather than queue proxy.

Otherwise these get intercepted by the drainer logic in queue proxy, which is not what we want. Initially liveness and readiness were routed via the queue proxy so that a single check could check both user container and queue proxy; while this makes some sense for Readiness (because we need this for optimisations in activator that need to check readiness of the whole pod at once), for liveness this is both not needed and actively harmful. A failing liveness probe results in Kubernetes restarting the failing container (in this case, the user container), but if the failure comes from QP this will not help (and is actively unhelpful). TL;dr: If we want to add a liveness probe to Queue Proxy, we should add that directly rather than modifying the user's probe to go via the Queue Proxy port.

knative-prow-robot · 2022-01-06T13:26:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: julz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/reconciler/OWNERS~~ [julz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2022-01-06T13:32:11Z

Codecov Report

Merging #12479 (847d724) into main (d436895) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main   #12479      +/-   ##
==========================================
+ Coverage   87.43%   87.44%   +0.01%     
==========================================
  Files         195      195              
  Lines        9655     9671      +16     
==========================================
+ Hits         8442     8457      +15     
  Misses        930      930              
- Partials      283      284       +1

Impacted Files	Coverage Δ
pkg/reconciler/revision/resources/deploy.go	`96.87% <100.00%> (ø)`
pkg/reconciler/configuration/configuration.go	`84.61% <0.00%> (-1.54%)`	⬇️
pkg/apis/config/defaults.go	`87.01% <0.00%> (-0.83%)`	⬇️
pkg/apis/serving/metadata_validation.go	`95.91% <0.00%> (-0.24%)`	⬇️
pkg/apis/serving/v1/route_validation.go	`97.72% <0.00%> (-0.13%)`	⬇️
cmd/queue/main.go	`0.53% <0.00%> (-0.01%)`	⬇️
cmd/activator/main.go	`0.00% <0.00%> (ø)`
pkg/apis/serving/v1/revision_types.go	`100.00% <0.00%> (ø)`
pkg/apis/serving/v1/service_validation.go	`100.00% <0.00%> (ø)`
pkg/reconciler/revision/resources/queue.go	`98.24% <0.00%> (+0.04%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a794e1f...847d724. Read the comment docs.

nader-ziada · 2022-01-06T13:53:50Z

thanks for fixing this @julz, lgtm, but do you think there is a test that can catch this better as mentioned in the issue #12462? we can also add it separately

julz · 2022-01-06T14:11:08Z

thanks for fixing this @julz, lgtm, but do you think there is a test that can catch this better as mentioned in the issue #12462? we can also add it separately

Yeah we could certainly use some more coverage for liveness probes (although after this change, happily, knative should be doing much much less that could break them)! Let me create an issue

Edit: => #12480

nader-ziada · 2022-01-06T14:30:54Z

/lgtm

dprotaso · 2022-01-12T17:34:15Z

@julz @nader-ziada @psschwei

Can we cherry pick this change to older branches? I haven't done the digging to see where the regression was introduced.

dprotaso · 2022-01-12T17:37:46Z

Issue mentioned v1.1 so let's do that at the minimum
/cherry-pick release-v1.1.0

knative-prow-robot · 2022-01-12T17:38:36Z

@dprotaso: cannot checkout release-v1.1.0: error checking out release-v1.1.0: exit status 1. output: error: pathspec 'release-v1.1.0' did not match any file(s) known to git

In response to this:

Issue mentioned v1.1 so let's do that at the minimum
/cherry-pick release-v1.1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dprotaso · 2022-01-12T17:45:25Z

whoops
/cherry-pick release-v1.1

knative-prow-robot · 2022-01-12T17:45:59Z

@dprotaso: cannot checkout release-v1.1: error checking out release-v1.1: exit status 1. output: error: pathspec 'release-v1.1' did not match any file(s) known to git

In response to this:

whoops
/cherry-pick release-v1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dprotaso · 2022-01-12T17:47:10Z

sigh - I obviously don't know our branch names lol

/cherry-pick release-1.1

knative-prow-robot · 2022-01-12T17:47:47Z

@dprotaso: new pull request created: #12498

In response to this:

sigh - I obviously don't know our branch names lol

/cherry-pick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

psschwei · 2022-01-12T18:15:51Z

Looks like the change was introduced in #12318 , so v1.1 should be the only cherry-pick needed.

knative-prow-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. area/API API objects and controllers labels Jan 6, 2022

knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 6, 2022

julz mentioned this pull request Jan 6, 2022

No test coverage for liveness probes #12480

Closed

knative-prow-robot assigned nader-ziada Jan 6, 2022

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 6, 2022

dprotaso added this to the v1.2.0 milestone Jan 6, 2022

knative-prow-robot merged commit 26b9a40 into knative:main Jan 6, 2022

knative-prow-robot mentioned this pull request Jan 12, 2022

[release-1.1] Move liveness checks to user container port rather than QP port #12498

Merged

pradithya mentioned this pull request Nov 4, 2022

feat: Update knative-serving-core to 1.3.2 caraml-dev/helm-charts#68

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move liveness checks to user container port rather than QP port #12479

Move liveness checks to user container port rather than QP port #12479

julz commented Jan 6, 2022

knative-prow-robot commented Jan 6, 2022

codecov bot commented Jan 6, 2022 •

edited

Loading

nader-ziada commented Jan 6, 2022

julz commented Jan 6, 2022 •

edited

Loading

nader-ziada commented Jan 6, 2022

dprotaso commented Jan 12, 2022

dprotaso commented Jan 12, 2022

knative-prow-robot commented Jan 12, 2022

dprotaso commented Jan 12, 2022

knative-prow-robot commented Jan 12, 2022

dprotaso commented Jan 12, 2022

knative-prow-robot commented Jan 12, 2022

psschwei commented Jan 12, 2022

Move liveness checks to user container port rather than QP port #12479

Move liveness checks to user container port rather than QP port #12479

Conversation

julz commented Jan 6, 2022

knative-prow-robot commented Jan 6, 2022

codecov bot commented Jan 6, 2022 • edited Loading

Codecov Report

nader-ziada commented Jan 6, 2022

julz commented Jan 6, 2022 • edited Loading

nader-ziada commented Jan 6, 2022

dprotaso commented Jan 12, 2022

dprotaso commented Jan 12, 2022

knative-prow-robot commented Jan 12, 2022

dprotaso commented Jan 12, 2022

knative-prow-robot commented Jan 12, 2022

dprotaso commented Jan 12, 2022

knative-prow-robot commented Jan 12, 2022

psschwei commented Jan 12, 2022

codecov bot commented Jan 6, 2022 •

edited

Loading

julz commented Jan 6, 2022 •

edited

Loading