RDS Stale since 1.22.1 upgrade #51612

Stono · 2024-06-18T08:14:22Z

Is this the right place to submit this?

This is not a security vulnerability or a crashing bug
This is not a question about how to use Istio

Bug Description

Hi,
I've upgraded some of our clusters from 1.21 to 1.22.1 today; and our alerts picked up RDS marked as stale:

❯ istioctl proxy-status | grep STALE
airflow-scheduler-6676c4688-kd4cc.data-platform-airflow                                                Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-b59c9fff9-hxs5r     1.22.1
buildabot-686b968545-d7f7k.buildabot                                                                   Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-b59c9fff9-xrqr6     1.22.1
buildabot-686b968545-qn4wc.buildabot                                                                   Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-b59c9fff9-bldvv     1.22.1
rabbitmq-0.rabbitmq                                                                                    Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-b59c9fff9-hxs5r     1.22.1
rabbitmq-1.rabbitmq                                                                                    Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-b59c9fff9-xrqr6     1.22.1
rabbitmq-2.rabbitmq                                                                                    Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-b59c9fff9-hxs5r     1.22.1

Istio docs say: "STALE means that Pilot has sent an update to Envoy but has not received an acknowledgement. This usually indicates a networking issue between Envoy and Pilot or a bug with Istio itself."

There's nothing telling in the istiod logs, or the proxies for these apps:

❯ kis logs -l istio=pilot --tail=10000000 | grep airflow-scheduler
{"level":"info","time":"2024-06-18T07:55:02.442241Z","scope":"delta","msg":"CDS: PUSH for node:airflow-scheduler-6676c4688-kd4cc.data-platform-airflow resources:3 removed:1 size:717B"}
{"level":"info","time":"2024-06-18T07:55:02.442278Z","scope":"delta","msg":"EDS: PUSH INC for node:airflow-scheduler-6676c4688-kd4cc.data-platform-airflow resources:0 removed:1 size:0B empty:0 cached:0/0"}
{"level":"info","time":"2024-06-18T07:55:02.442808Z","scope":"delta","msg":"LDS: PUSH for node:airflow-scheduler-6676c4688-kd4cc.data-platform-airflow resources:6 removed:1 size:55.8kB"}
{"level":"info","time":"2024-06-18T07:55:02.443387Z","scope":"delta","msg":"RDS: PUSH for node:airflow-scheduler-6676c4688-kd4cc.data-platform-airflow resources:4 removed:0 size:7.7kB cached:0/4"}
{"level":"info","time":"2024-06-18T07:55:06.791285Z","scope":"delta","msg":"CDS: PUSH for node:airflow-scheduler-6676c4688-kd4cc.data-platform-airflow resources:13 removed:0 size:10.9kB cached:10/10"}
{"level":"info","time":"2024-06-18T07:55:06.791790Z","scope":"delta","msg":"LDS: PUSH for node:airflow-scheduler-6676c4688-kd4cc.data-platform-airflow resources:6 removed:0 size:55.8kB"}
{"level":"info","time":"2024-06-18T07:55:06.797804Z","scope":"delta","msg":"RDS: PUSH for node:airflow-scheduler-6676c4688-kd4cc.data-platform-airflow resources:4 removed:0 size:7.7kB cached:0/4"}
{"level":"info","time":"2024-06-18T02:23:09.373215Z","scope":"delta","msg":"ADS: new delta connection for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow-896"}
{"level":"info","time":"2024-06-18T02:23:09.373489Z","scope":"delta","msg":"CDS: PUSH request for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:13 removed:0 size:10.9kB cached:8/10"}
{"level":"info","time":"2024-06-18T02:23:09.393717Z","scope":"delta","msg":"EDS: PUSH request for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:8 removed:0 size:1.8kB empty:0 cached:6/8 filtered:0"}
{"level":"info","time":"2024-06-18T02:23:09.451247Z","scope":"delta","msg":"LDS: PUSH request for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:6 removed:0 size:55.8kB"}
{"level":"info","time":"2024-06-18T02:23:09.461756Z","scope":"delta","msg":"RDS: PUSH request for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:3 removed:0 size:7.7kB cached:0/3 filtered:0"}
{"level":"info","time":"2024-06-18T02:23:13.541136Z","scope":"delta","msg":"ADS: \"10.198.14.251:51216\" airflow-scheduler-6676c4688-z2jq4.data-platform-airflow-437 terminated"}
{"level":"info","time":"2024-06-18T02:24:05.717016Z","scope":"delta","msg":"CDS: PUSH for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:13 removed:0 size:10.9kB cached:8/10"}
{"level":"info","time":"2024-06-18T02:24:05.717131Z","scope":"delta","msg":"EDS: PUSH for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:8 removed:0 size:2.1kB empty:0 cached:7/8"}
{"level":"info","time":"2024-06-18T02:24:05.717680Z","scope":"delta","msg":"LDS: PUSH for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:6 removed:0 size:55.8kB"}
{"level":"info","time":"2024-06-18T02:24:05.718059Z","scope":"delta","msg":"RDS: PUSH for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:3 removed:0 size:7.7kB cached:0/3"}
{"level":"info","time":"2024-06-18T04:23:20.415715Z","scope":"delta","msg":"ADS: \"10.198.0.170:56930\" airflow-scheduler-6676c4688-djdqp.data-platform-airflow-896 terminated"}

Version

1.21.1
k8s 1.28

Additional Information

What I find interesting is it seems to be the same subset (like, 6) of applications on each cluster (700 apps on the clusters). There is nothing unique that i'm aware of compared to the other apps (they're all built from the same helm chart so have loosely the same configuration).

These clusters are completely isolated/unique by the way, and the same app is deployed on them.

The text was updated successfully, but these errors were encountered:

howardjohn · 2024-06-18T13:57:47Z

Most likely cause of this is moving to delta XDS. Not sure why at all, but that is the main change in the area. Will need to look into it some more.

If you can get --log_output_level=delta:debug logs it could help

keithmattix · 2024-06-18T14:30:50Z

This may be related to some other nacking issues with delta I've been looking into. I need to bump with the envoy folks

Stono · 2024-06-18T14:59:45Z

@howardjohn will add it, annoyingly they've self resolved now so we'll need it to come back!

Stono · 2024-06-18T16:27:48Z

@howardjohn

❯ istioctl proxy-status | grep STALE
ingress-nginx-internal-controller-784ddf9b7b-657t4.ingress-nginx                          Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-866f8dd479-2qfdq     1.22.1
ingress-nginx-internal-controller-784ddf9b7b-7zgcg.ingress-nginx                          Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-866f8dd479-4xgq9     1.22.1
rabbitmq-0.rabbitmq                                                                       Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-866f8dd479-4xgq9     1.22.1
rabbitmq-1.rabbitmq                                                                       Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-866f8dd479-27k6w     1.22.1
rabbitmq-2.rabbitmq                                                                       Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-866f8dd479-2qfdq     1.22.1

~/git/autotrader
❯ cat pilot.json|grep -i rabbitmq-0
{"level":"debug","time":"2024-06-18T16:21:19.925566Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}
{"level":"debug","time":"2024-06-18T16:21:20.894944Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}
{"level":"debug","time":"2024-06-18T16:21:30.568059Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}
{"level":"debug","time":"2024-06-18T16:21:39.938825Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}
{"level":"debug","time":"2024-06-18T16:21:40.902317Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}
{"level":"debug","time":"2024-06-18T16:26:16.249571Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}
{"level":"debug","time":"2024-06-18T16:26:20.757449Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}
{"level":"debug","time":"2024-06-18T16:26:51.153676Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}

Stono · 2024-06-18T16:28:54Z

it's alway the same workloads, in each environment. There's got to be some sort of correlating factor, but we can't spot it.

Stono · 2024-06-18T16:48:05Z

I'll run with ISTIO_DELTA_XDS=false to rule delta in or out

Stono · 2024-06-18T16:48:45Z

Argh scratch that, i have to do it on the proxies and not pilot?
Is there any way to disable delta at the pilot level?

howardjohn · 2024-06-18T17:15:35Z

No, it needs to be on the proxies. But you can configure that cluster wide in meshConfig as defaultConfig.proxyMetadata

Stono · 2024-06-18T17:30:26Z

Does meshConfig.proxyMetadata merge with any pod level proxyMetadata?

Feature request for flags like this, it'd be really great if they were control plane level. Historically we've always been able to configure push configuration there, if this was a production impacting incident it'd be much nicer for us to be able to opt out of the behaviour at the pilot level, than redeploying 1000's of workloads. The fact that there's even an opt-out signifies that there was sufficient enough concern it might break something.

howardjohn · 2024-06-18T17:42:40Z

In this case it cannot really be controlled at the control plane level since the protocol is initiated by the proxy

Stono · 2024-06-18T18:49:26Z

Ah, OK.

Stono · 2024-06-20T06:35:13Z

If it helps the one thing I've noticed is it happens during the day when there's more cluster churn (things being deployed). Overnight, it goes away. So it's certainly related to deployments (not materially changing the istio spec, just new pods) causing config pushes.

howardjohn · 2024-07-24T14:27:43Z

For full clarity, in slack discussions with @Stono it was discussed this was happening on 1.22 even without Delta, but at a much lower frequency. The fix in #52278 impacts only delta, and the same bug cannot occur in Sotw mode. Will follow up on that when more info is available

istio-policy-bot added the area/upgrade Issues related to upgrades label Jun 18, 2024

This was referenced Jun 21, 2024

stale proxies when ISTIO_DELTA_XDS is true #51165

Closed

Ability to control envoy log level for a workload, at runtime #51703

Open

Stono mentioned this issue Jul 1, 2024

Sporadic 504 timeouts since 1.22 upgrade #51660

Open

2 tasks

howardjohn added this to the 1.23 milestone Jul 18, 2024

howardjohn self-assigned this Jul 23, 2024

howardjohn mentioned this issue Jul 23, 2024

pilot: fix treating spontaneous request as ACK #52278

Merged

istio-testing closed this as completed in #52278 Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDS Stale since 1.22.1 upgrade #51612

RDS Stale since 1.22.1 upgrade #51612

Stono commented Jun 18, 2024 •

edited

Loading

howardjohn commented Jun 18, 2024

keithmattix commented Jun 18, 2024

Stono commented Jun 18, 2024

Stono commented Jun 18, 2024

Stono commented Jun 18, 2024

Stono commented Jun 18, 2024

Stono commented Jun 18, 2024

howardjohn commented Jun 18, 2024

Stono commented Jun 18, 2024 •

edited

Loading

howardjohn commented Jun 18, 2024

Stono commented Jun 18, 2024

Stono commented Jun 20, 2024

howardjohn commented Jul 24, 2024

RDS Stale since 1.22.1 upgrade #51612

RDS Stale since 1.22.1 upgrade #51612

Comments

Stono commented Jun 18, 2024 • edited Loading

Is this the right place to submit this?

Bug Description

Version

Additional Information

howardjohn commented Jun 18, 2024

keithmattix commented Jun 18, 2024

Stono commented Jun 18, 2024

Stono commented Jun 18, 2024

Stono commented Jun 18, 2024

Stono commented Jun 18, 2024

Stono commented Jun 18, 2024

howardjohn commented Jun 18, 2024

Stono commented Jun 18, 2024 • edited Loading

howardjohn commented Jun 18, 2024

Stono commented Jun 18, 2024

Stono commented Jun 20, 2024

howardjohn commented Jul 24, 2024

Stono commented Jun 18, 2024 •

edited

Loading

Stono commented Jun 18, 2024 •

edited

Loading