Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic 504 timeouts since 1.22 upgrade #51660

Open
2 tasks done
Stono opened this issue Jun 21, 2024 · 7 comments
Open
2 tasks done

Sporadic 504 timeouts since 1.22 upgrade #51660

Stono opened this issue Jun 21, 2024 · 7 comments

Comments

@Stono
Copy link
Contributor

Stono commented Jun 21, 2024

Is this the right place to submit this?

  • This is not a security vulnerability or a crashing bug
  • This is not a question about how to use Istio

Bug Description

Hello!
Back with another issue i'm struggling to diagnose since 1.22 upgrade.

Really infrequently, we're bursts of 504s. A single pod, for a period of time (a few seconds), will fail to talk to all its upstreams.

The traces are extra weird. What we always see is a span that times out with a 504UT, the timeout is the timeout configured on the destination services VirtualService (as you'd expect). However the request does eventually get handled by that destination workload (and the span for its handling the request is rapid, as you'd expect.

Here's an example:
Screenshot 2024-06-21 at 15 17 47

Here's another example:
Screenshot 2024-06-21 at 15 19 05

Countless examples of where for that brief moment, a singular source pod gets this behaviour with N destination workloads.

The problem is that it's so sporadic, like a couple of times per day, for a handful of requests out of many millions, on completely non-deterministic workloads. As a result I have no chance at all of capturing it with a packet trace.

I appreciate there's just not enough information here to actually debug. I've just spent 3 days looking into this and felt I should share it in case others had observed the same bahviour too.

I have ruled out everything possible that isn't istio. Today I've completed rolled back 1.22 to 1.21 everywhere, and the issues appear to have stopped. However I won't know conclusively for another few days.

Version

1.22.1

Additional Information

Obviously this is an on going investigation for me so if i find anything else to help, i'll add to this thread.

@Stono
Copy link
Contributor Author

Stono commented Jun 24, 2024

Just to update here, over 48hrs since we reverted and we've not had any reoccurrence of this issue.

@howardjohn
Copy link
Member

Looking at the spans, it looks like 90s of ingress active, then some delay, then the rest of the apps (quick). Huge period of time where nothing is active.

Is there some explanation for that behavior? or is that part of the issue

@Stono
Copy link
Contributor Author

Stono commented Jun 24, 2024

Yes that is the issue.
What we see, in the example above, would be:

  • nginx -> coordinated-web-service 👍
  • coordinated-web-service logs inbound request 👍
  • coordinated-web-service logs outbound request 👍
  • request hits the destination service.timeout (in this example; 90s); error returned.
  • some time later, the request is received processed by the downstream system.

The best way i can describe is that there's something blocking in the source system sidecar, preventing that request from getting sent (despite the span being recorded) or equally, the destination sidecar is not forwarding it to the local app in a timely manner.

@Stono
Copy link
Contributor Author

Stono commented Jun 24, 2024

And this will happen for all requests to one or more destination hosts from a single pod for say 5-10s, then it'll all recover.

And to be clear it'll just be one pod, so in the above example lets say - there's 3 source and 3 destination, just one of the source pods will stop being able to communicate outbound to N of its downstream systems.

Screenshot 2024-06-24 at 21 20 43

The fact that it happens on one source pod indicates to me the problem is at the source, rather than destination.

@Stono
Copy link
Contributor Author

Stono commented Jun 24, 2024

I honestly don't know where to go from here, I kinda feel stuck, I've tried everything I could think of to determinstically reproduce it and haven't been able to do so. It's so infrequent and only seems to occur in higher load environments. And the pods are utterly random. Subsequently I can't possibly run a wireshark or similar for long enough to catch it on all the possible pods that it could be. The only way i'd actually capture it would be pcaps on every node and turning on debug proxy logs everywhere, which given the volume of the environments, isn't really feasible.

Equally I fully appreciate how here there isn't remotely enough information for Istio folks to be able to help me debug it either. So I'm kinda 🤷 It just felt right to share as I do not get this on 1.21 sidecars - just to see if anyone else experiences it too.

@Stono
Copy link
Contributor Author

Stono commented Jul 1, 2024

So a quick update here, with the release of 1.22.2, I rolled forward again - but this time I set ISTIO_DELTA_XDS: false due to #51612.

So far I haven't seen a reoccurrence of this issue which means one of:

  • It's some interaction with Delta XDS
  • It was fixed in 1.22.2
  • It was nothing to do with istio and some other network level problem was manifesting at exactly the same time and stopped when i rolled back (unlikely - but can't prove either way!)

@Stono
Copy link
Contributor Author

Stono commented Jul 2, 2024

Spoke too soon, just had a batch on 1.22.2, single pod, single node, other things on that node not affected therefore you can infer its a pod issue.

Screenshot 2024-07-02 at 19 08 53

I do in parallel have an open ticket with GCP because we're seeing other strange things today that feel networky (i know....), but this one to me feels really isolated to the pod.

It's so intermittent though; this workload has been on 1.22.2 since Friday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants