Sporadic 504 timeouts since 1.22 upgrade #51660

Stono · 2024-06-21T14:23:02Z

Is this the right place to submit this?

This is not a security vulnerability or a crashing bug
This is not a question about how to use Istio

Bug Description

Hello!
Back with another issue i'm struggling to diagnose since 1.22 upgrade.

Really infrequently, we're bursts of 504s. A single pod, for a period of time (a few seconds), will fail to talk to all its upstreams.

The traces are extra weird. What we always see is a span that times out with a 504UT, the timeout is the timeout configured on the destination services VirtualService (as you'd expect). However the request does eventually get handled by that destination workload (and the span for its handling the request is rapid, as you'd expect.

Here's an example:

Here's another example:

Countless examples of where for that brief moment, a singular source pod gets this behaviour with N destination workloads.

The problem is that it's so sporadic, like a couple of times per day, for a handful of requests out of many millions, on completely non-deterministic workloads. As a result I have no chance at all of capturing it with a packet trace.

I appreciate there's just not enough information here to actually debug. I've just spent 3 days looking into this and felt I should share it in case others had observed the same bahviour too.

I have ruled out everything possible that isn't istio. Today I've completed rolled back 1.22 to 1.21 everywhere, and the issues appear to have stopped. However I won't know conclusively for another few days.

Version

1.22.1

Additional Information

Obviously this is an on going investigation for me so if i find anything else to help, i'll add to this thread.

The text was updated successfully, but these errors were encountered:

Stono · 2024-06-24T19:27:42Z

Just to update here, over 48hrs since we reverted and we've not had any reoccurrence of this issue.

howardjohn · 2024-06-24T19:37:40Z

Looking at the spans, it looks like 90s of ingress active, then some delay, then the rest of the apps (quick). Huge period of time where nothing is active.

Is there some explanation for that behavior? or is that part of the issue

Stono · 2024-06-24T20:08:12Z

Yes that is the issue.
What we see, in the example above, would be:

nginx -> coordinated-web-service 👍
coordinated-web-service logs inbound request 👍
coordinated-web-service logs outbound request 👍
request hits the destination service.timeout (in this example; 90s); error returned.
some time later, the request is received processed by the downstream system.

The best way i can describe is that there's something blocking in the source system sidecar, preventing that request from getting sent (despite the span being recorded) or equally, the destination sidecar is not forwarding it to the local app in a timely manner.

Stono · 2024-06-24T20:10:30Z

And this will happen for all requests to one or more destination hosts from a single pod for say 5-10s, then it'll all recover.

And to be clear it'll just be one pod, so in the above example lets say - there's 3 source and 3 destination, just one of the source pods will stop being able to communicate outbound to N of its downstream systems.

The fact that it happens on one source pod indicates to me the problem is at the source, rather than destination.

Stono · 2024-06-24T20:11:56Z

I honestly don't know where to go from here, I kinda feel stuck, I've tried everything I could think of to determinstically reproduce it and haven't been able to do so. It's so infrequent and only seems to occur in higher load environments. And the pods are utterly random. Subsequently I can't possibly run a wireshark or similar for long enough to catch it on all the possible pods that it could be. The only way i'd actually capture it would be pcaps on every node and turning on debug proxy logs everywhere, which given the volume of the environments, isn't really feasible.

Equally I fully appreciate how here there isn't remotely enough information for Istio folks to be able to help me debug it either. So I'm kinda 🤷 It just felt right to share as I do not get this on 1.21 sidecars - just to see if anyone else experiences it too.

Stono · 2024-07-01T07:53:32Z

So a quick update here, with the release of 1.22.2, I rolled forward again - but this time I set ISTIO_DELTA_XDS: false due to #51612.

So far I haven't seen a reoccurrence of this issue which means one of:

It's some interaction with Delta XDS
It was fixed in 1.22.2
It was nothing to do with istio and some other network level problem was manifesting at exactly the same time and stopped when i rolled back (unlikely - but can't prove either way!)

Stono · 2024-07-02T18:18:50Z

Spoke too soon, just had a batch on 1.22.2, single pod, single node, other things on that node not affected therefore you can infer its a pod issue.

I do in parallel have an open ticket with GCP because we're seeing other strange things today that feel networky (i know....), but this one to me feels really isolated to the pod.

It's so intermittent though; this workload has been on 1.22.2 since Friday.

istio-policy-bot added area/networking area/perf and scalability area/upgrade Issues related to upgrades labels Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic 504 timeouts since 1.22 upgrade #51660

Sporadic 504 timeouts since 1.22 upgrade #51660

Stono commented Jun 21, 2024 •

edited

Loading

Stono commented Jun 24, 2024

howardjohn commented Jun 24, 2024

Stono commented Jun 24, 2024 •

edited

Loading

Stono commented Jun 24, 2024 •

edited

Loading

Stono commented Jun 24, 2024 •

edited

Loading

Stono commented Jul 1, 2024 •

edited

Loading

Stono commented Jul 2, 2024 •

edited

Loading

Sporadic 504 timeouts since 1.22 upgrade #51660

Sporadic 504 timeouts since 1.22 upgrade #51660

Comments

Stono commented Jun 21, 2024 • edited Loading

Is this the right place to submit this?

Bug Description

Version

Additional Information

Stono commented Jun 24, 2024

howardjohn commented Jun 24, 2024

Stono commented Jun 24, 2024 • edited Loading

Stono commented Jun 24, 2024 • edited Loading

Stono commented Jun 24, 2024 • edited Loading

Stono commented Jul 1, 2024 • edited Loading

Stono commented Jul 2, 2024 • edited Loading

Stono commented Jun 21, 2024 •

edited

Loading

Stono commented Jun 24, 2024 •

edited

Loading

Stono commented Jun 24, 2024 •

edited

Loading

Stono commented Jun 24, 2024 •

edited

Loading

Stono commented Jul 1, 2024 •

edited

Loading

Stono commented Jul 2, 2024 •

edited

Loading