-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic 504 timeouts since 1.22 upgrade #51660
Comments
Just to update here, over 48hrs since we reverted and we've not had any reoccurrence of this issue. |
Looking at the spans, it looks like 90s of ingress active, then some delay, then the rest of the apps (quick). Huge period of time where nothing is active. Is there some explanation for that behavior? or is that part of the issue |
Yes that is the issue.
The best way i can describe is that there's something blocking in the source system sidecar, preventing that request from getting sent (despite the span being recorded) or equally, the destination sidecar is not forwarding it to the local app in a timely manner. |
I honestly don't know where to go from here, I kinda feel stuck, I've tried everything I could think of to determinstically reproduce it and haven't been able to do so. It's so infrequent and only seems to occur in higher load environments. And the pods are utterly random. Subsequently I can't possibly run a wireshark or similar for long enough to catch it on all the possible pods that it could be. The only way i'd actually capture it would be pcaps on every node and turning on debug proxy logs everywhere, which given the volume of the environments, isn't really feasible. Equally I fully appreciate how here there isn't remotely enough information for Istio folks to be able to help me debug it either. So I'm kinda 🤷 It just felt right to share as I do not get this on 1.21 sidecars - just to see if anyone else experiences it too. |
So a quick update here, with the release of So far I haven't seen a reoccurrence of this issue which means one of:
|
Spoke too soon, just had a batch on 1.22.2, single pod, single node, other things on that node not affected therefore you can infer its a pod issue. I do in parallel have an open ticket with GCP because we're seeing other strange things today that feel networky (i know....), but this one to me feels really isolated to the pod. It's so intermittent though; this workload has been on 1.22.2 since Friday. |
Is this the right place to submit this?
Bug Description
Hello!
Back with another issue i'm struggling to diagnose since 1.22 upgrade.
Really infrequently, we're bursts of 504s. A single pod, for a period of time (a few seconds), will fail to talk to all its upstreams.
The traces are extra weird. What we always see is a span that times out with a
504UT
, the timeout is the timeout configured on the destination servicesVirtualService
(as you'd expect). However the request does eventually get handled by that destination workload (and the span for its handling the request is rapid, as you'd expect.Here's an example:
Here's another example:
Countless examples of where for that brief moment, a singular source pod gets this behaviour with N destination workloads.
The problem is that it's so sporadic, like a couple of times per day, for a handful of requests out of many millions, on completely non-deterministic workloads. As a result I have no chance at all of capturing it with a packet trace.
I appreciate there's just not enough information here to actually debug. I've just spent 3 days looking into this and felt I should share it in case others had observed the same bahviour too.
I have ruled out everything possible that isn't istio. Today I've completed rolled back 1.22 to 1.21 everywhere, and the issues appear to have stopped. However I won't know conclusively for another few days.
Version
Additional Information
Obviously this is an on going investigation for me so if i find anything else to help, i'll add to this thread.
The text was updated successfully, but these errors were encountered: