-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing liveness probes can cause service degradation #14980
Comments
@ReToCode when K8s restart the affected user container, QP's readinessProbe check should also failed because it inherits the origin readinessProbe configuration of user container, then the trafffic should not go to this pod, right? so I was thinking why did what you said happen. Please help to correct me if you are available. |
Yes this is correct. When your readiness-probe also fails, this is not an issue. It only happens in a scenario when the liveness and readiness probes are two different probes, then liveness fails and K8s restarts the specific container (not Pod), thus the readiness could be healthy or take longer to also start failing. During that short period until the container is back up, traffic can error out. |
@ReToCode Ok I got your point. So perhaps we need to do some deep control for livenessProbe, just like what we do to readinessProbe? |
I suppose this problem can also occur when the LivenessProbe has a different PeriodSeconds or/and FailureThreshold from the ReadinessProbe. In this case, the Readiness probe might spot a failure at different times and not remove Endpoints from the Service quickly enough. |
@ReToCode I'm seeing this same Could it be that the queue proxy is running its own liveness probe against the user container and that's causing the issue you're describing? The error happens immediately once the service pod becomes ready, and when it happens, the request doesn't go through and get processed. Really appreciate any pointers! |
Description
Users are allowed to define a
livenessProbe
on their container (for now just on the main one, with #14853 also on all sidecars). If the user defines just alivenessProbe
without also defining the same check asreadinessProbe
we can have the following situation:The text was updated successfully, but these errors were encountered: