Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move liveness checks to user container port rather than QP port #12479

Merged
merged 1 commit into from
Jan 6, 2022

Conversation

julz
Copy link
Member

@julz julz commented Jan 6, 2022

.. Otherwise these get intercepted by the drainer logic in queue proxy, which is not what we want.

This caused the issue in #12462 since we ended up with a liveness probe hitting the QP health check logic, which had worked in the past, but now fails since QP's health checks no longer delegate to the user container if the header isn't passed (which is how this worked before).

We could fix #12462 by making QP delegate liveness checks as it did in the past, but actually I think this was always a mistake to do: QP shouldn't be involved with liveness checks in the first place (note this is already the case for TCP liveness probes and the default case where there is no liveness probe).

Initially liveness and readiness were routed via the queue proxy so that a single check could check both user container and queue proxy; while this makes some sense for Readiness (because we need this for optimisations in activator that need to check readiness of the whole pod at once), for liveness this is both not needed and actively harmful. A failing liveness probe results in Kubernetes restarting the container the probe is defined on (in this case, the user container), but if the failure comes from QP this will not help (and is actively unhelpful since the wrong container will be restarted).

(If we want to add a liveness probe to Queue Proxy, we should add that directly rather than modifying the user's probe to go via the Queue Proxy port, that way the right thing will be restarted if there's a deadlock in QP - but we can do that separately).

Fixes #12462.

Release Note

Changes liveness probes to directly probe the user container rather than queue proxy.

Otherwise these get intercepted by the drainer logic in queue proxy, which is
not what we want.

Initially liveness and readiness were routed via the queue proxy so that a
single check could check both user container and queue proxy; while this makes
some sense for Readiness (because we need this for optimisations in activator
that need to check readiness of the whole pod at once), for liveness this is
both not needed and actively harmful. A failing liveness probe results in
Kubernetes restarting the failing container (in this case, the user container),
but if the failure comes from QP this will not help (and is actively
unhelpful).  TL;dr: If we want to add a liveness probe to Queue Proxy, we should
add that directly rather than modifying the user's probe to go via the Queue
Proxy port.
@knative-prow-robot knative-prow-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. area/API API objects and controllers labels Jan 6, 2022
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: julz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 6, 2022
@codecov
Copy link

codecov bot commented Jan 6, 2022

Codecov Report

Merging #12479 (847d724) into main (d436895) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #12479      +/-   ##
==========================================
+ Coverage   87.43%   87.44%   +0.01%     
==========================================
  Files         195      195              
  Lines        9655     9671      +16     
==========================================
+ Hits         8442     8457      +15     
  Misses        930      930              
- Partials      283      284       +1     
Impacted Files Coverage Δ
pkg/reconciler/revision/resources/deploy.go 96.87% <100.00%> (ø)
pkg/reconciler/configuration/configuration.go 84.61% <0.00%> (-1.54%) ⬇️
pkg/apis/config/defaults.go 87.01% <0.00%> (-0.83%) ⬇️
pkg/apis/serving/metadata_validation.go 95.91% <0.00%> (-0.24%) ⬇️
pkg/apis/serving/v1/route_validation.go 97.72% <0.00%> (-0.13%) ⬇️
cmd/queue/main.go 0.53% <0.00%> (-0.01%) ⬇️
cmd/activator/main.go 0.00% <0.00%> (ø)
pkg/apis/serving/v1/revision_types.go 100.00% <0.00%> (ø)
pkg/apis/serving/v1/service_validation.go 100.00% <0.00%> (ø)
pkg/reconciler/revision/resources/queue.go 98.24% <0.00%> (+0.04%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a794e1f...847d724. Read the comment docs.

@nader-ziada
Copy link
Member

thanks for fixing this @julz, lgtm, but do you think there is a test that can catch this better as mentioned in the issue #12462? we can also add it separately

@julz
Copy link
Member Author

julz commented Jan 6, 2022

thanks for fixing this @julz, lgtm, but do you think there is a test that can catch this better as mentioned in the issue #12462? we can also add it separately

Yeah we could certainly use some more coverage for liveness probes (although after this change, happily, knative should be doing much much less that could break them)! Let me create an issue

Edit: => #12480

@nader-ziada
Copy link
Member

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 6, 2022
@dprotaso dprotaso added this to the v1.2.0 milestone Jan 6, 2022
@knative-prow-robot knative-prow-robot merged commit 26b9a40 into knative:main Jan 6, 2022
@dprotaso
Copy link
Member

@julz @nader-ziada @psschwei

Can we cherry pick this change to older branches? I haven't done the digging to see where the regression was introduced.

@dprotaso
Copy link
Member

Issue mentioned v1.1 so let's do that at the minimum
/cherry-pick release-v1.1.0

@knative-prow-robot
Copy link
Contributor

@dprotaso: cannot checkout release-v1.1.0: error checking out release-v1.1.0: exit status 1. output: error: pathspec 'release-v1.1.0' did not match any file(s) known to git

In response to this:

Issue mentioned v1.1 so let's do that at the minimum
/cherry-pick release-v1.1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dprotaso
Copy link
Member

whoops
/cherry-pick release-v1.1

@knative-prow-robot
Copy link
Contributor

@dprotaso: cannot checkout release-v1.1: error checking out release-v1.1: exit status 1. output: error: pathspec 'release-v1.1' did not match any file(s) known to git

In response to this:

whoops
/cherry-pick release-v1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dprotaso
Copy link
Member

sigh - I obviously don't know our branch names lol

/cherry-pick release-1.1

@knative-prow-robot
Copy link
Contributor

@dprotaso: new pull request created: #12498

In response to this:

sigh - I obviously don't know our branch names lol

/cherry-pick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@psschwei
Copy link
Contributor

Looks like the change was introduced in #12318 , so v1.1 should be the only cherry-pick needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Liveness probe failing with 400
5 participants