Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDS Stale since 1.22.1 upgrade #51612

Closed
2 tasks done
Stono opened this issue Jun 18, 2024 · 13 comments · Fixed by #52278
Closed
2 tasks done

RDS Stale since 1.22.1 upgrade #51612

Stono opened this issue Jun 18, 2024 · 13 comments · Fixed by #52278
Assignees
Labels
area/upgrade Issues related to upgrades
Milestone

Comments

@Stono
Copy link
Contributor

Stono commented Jun 18, 2024

Is this the right place to submit this?

  • This is not a security vulnerability or a crashing bug
  • This is not a question about how to use Istio

Bug Description

Hi,
I've upgraded some of our clusters from 1.21 to 1.22.1 today; and our alerts picked up RDS marked as stale:

❯ istioctl proxy-status | grep STALE
airflow-scheduler-6676c4688-kd4cc.data-platform-airflow                                                Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-b59c9fff9-hxs5r     1.22.1
buildabot-686b968545-d7f7k.buildabot                                                                   Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-b59c9fff9-xrqr6     1.22.1
buildabot-686b968545-qn4wc.buildabot                                                                   Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-b59c9fff9-bldvv     1.22.1
rabbitmq-0.rabbitmq                                                                                    Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-b59c9fff9-hxs5r     1.22.1
rabbitmq-1.rabbitmq                                                                                    Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-b59c9fff9-xrqr6     1.22.1
rabbitmq-2.rabbitmq                                                                                    Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-b59c9fff9-hxs5r     1.22.1

Istio docs say: "STALE means that Pilot has sent an update to Envoy but has not received an acknowledgement. This usually indicates a networking issue between Envoy and Pilot or a bug with Istio itself."

There's nothing telling in the istiod logs, or the proxies for these apps:

❯ kis logs -l istio=pilot --tail=10000000 | grep airflow-scheduler
{"level":"info","time":"2024-06-18T07:55:02.442241Z","scope":"delta","msg":"CDS: PUSH for node:airflow-scheduler-6676c4688-kd4cc.data-platform-airflow resources:3 removed:1 size:717B"}
{"level":"info","time":"2024-06-18T07:55:02.442278Z","scope":"delta","msg":"EDS: PUSH INC for node:airflow-scheduler-6676c4688-kd4cc.data-platform-airflow resources:0 removed:1 size:0B empty:0 cached:0/0"}
{"level":"info","time":"2024-06-18T07:55:02.442808Z","scope":"delta","msg":"LDS: PUSH for node:airflow-scheduler-6676c4688-kd4cc.data-platform-airflow resources:6 removed:1 size:55.8kB"}
{"level":"info","time":"2024-06-18T07:55:02.443387Z","scope":"delta","msg":"RDS: PUSH for node:airflow-scheduler-6676c4688-kd4cc.data-platform-airflow resources:4 removed:0 size:7.7kB cached:0/4"}
{"level":"info","time":"2024-06-18T07:55:06.791285Z","scope":"delta","msg":"CDS: PUSH for node:airflow-scheduler-6676c4688-kd4cc.data-platform-airflow resources:13 removed:0 size:10.9kB cached:10/10"}
{"level":"info","time":"2024-06-18T07:55:06.791790Z","scope":"delta","msg":"LDS: PUSH for node:airflow-scheduler-6676c4688-kd4cc.data-platform-airflow resources:6 removed:0 size:55.8kB"}
{"level":"info","time":"2024-06-18T07:55:06.797804Z","scope":"delta","msg":"RDS: PUSH for node:airflow-scheduler-6676c4688-kd4cc.data-platform-airflow resources:4 removed:0 size:7.7kB cached:0/4"}
{"level":"info","time":"2024-06-18T02:23:09.373215Z","scope":"delta","msg":"ADS: new delta connection for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow-896"}
{"level":"info","time":"2024-06-18T02:23:09.373489Z","scope":"delta","msg":"CDS: PUSH request for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:13 removed:0 size:10.9kB cached:8/10"}
{"level":"info","time":"2024-06-18T02:23:09.393717Z","scope":"delta","msg":"EDS: PUSH request for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:8 removed:0 size:1.8kB empty:0 cached:6/8 filtered:0"}
{"level":"info","time":"2024-06-18T02:23:09.451247Z","scope":"delta","msg":"LDS: PUSH request for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:6 removed:0 size:55.8kB"}
{"level":"info","time":"2024-06-18T02:23:09.461756Z","scope":"delta","msg":"RDS: PUSH request for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:3 removed:0 size:7.7kB cached:0/3 filtered:0"}
{"level":"info","time":"2024-06-18T02:23:13.541136Z","scope":"delta","msg":"ADS: \"10.198.14.251:51216\" airflow-scheduler-6676c4688-z2jq4.data-platform-airflow-437 terminated"}
{"level":"info","time":"2024-06-18T02:24:05.717016Z","scope":"delta","msg":"CDS: PUSH for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:13 removed:0 size:10.9kB cached:8/10"}
{"level":"info","time":"2024-06-18T02:24:05.717131Z","scope":"delta","msg":"EDS: PUSH for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:8 removed:0 size:2.1kB empty:0 cached:7/8"}
{"level":"info","time":"2024-06-18T02:24:05.717680Z","scope":"delta","msg":"LDS: PUSH for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:6 removed:0 size:55.8kB"}
{"level":"info","time":"2024-06-18T02:24:05.718059Z","scope":"delta","msg":"RDS: PUSH for node:airflow-scheduler-6676c4688-djdqp.data-platform-airflow resources:3 removed:0 size:7.7kB cached:0/3"}
{"level":"info","time":"2024-06-18T04:23:20.415715Z","scope":"delta","msg":"ADS: \"10.198.0.170:56930\" airflow-scheduler-6676c4688-djdqp.data-platform-airflow-896 terminated"}

Version

1.21.1
k8s 1.28

Additional Information

What I find interesting is it seems to be the same subset (like, 6) of applications on each cluster (700 apps on the clusters). There is nothing unique that i'm aware of compared to the other apps (they're all built from the same helm chart so have loosely the same configuration).

These clusters are completely isolated/unique by the way, and the same app is deployed on them.

@istio-policy-bot istio-policy-bot added the area/upgrade Issues related to upgrades label Jun 18, 2024
@howardjohn
Copy link
Member

Most likely cause of this is moving to delta XDS. Not sure why at all, but that is the main change in the area. Will need to look into it some more.

If you can get --log_output_level=delta:debug logs it could help

@keithmattix
Copy link
Contributor

This may be related to some other nacking issues with delta I've been looking into. I need to bump with the envoy folks

@Stono
Copy link
Contributor Author

Stono commented Jun 18, 2024

@howardjohn will add it, annoyingly they've self resolved now so we'll need it to come back!

@Stono
Copy link
Contributor Author

Stono commented Jun 18, 2024

@howardjohn

❯ istioctl proxy-status | grep STALE
ingress-nginx-internal-controller-784ddf9b7b-657t4.ingress-nginx                          Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-866f8dd479-2qfdq     1.22.1
ingress-nginx-internal-controller-784ddf9b7b-7zgcg.ingress-nginx                          Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-866f8dd479-4xgq9     1.22.1
rabbitmq-0.rabbitmq                                                                       Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-866f8dd479-4xgq9     1.22.1
rabbitmq-1.rabbitmq                                                                       Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-866f8dd479-27k6w     1.22.1
rabbitmq-2.rabbitmq                                                                       Kubernetes     SYNCED     SYNCED     SYNCED     STALE (Never Acknowledged)     NOT SENT     istiod-866f8dd479-2qfdq     1.22.1

~/git/autotrader
❯ cat pilot.json|grep -i rabbitmq-0
{"level":"debug","time":"2024-06-18T16:21:19.925566Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}
{"level":"debug","time":"2024-06-18T16:21:20.894944Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}
{"level":"debug","time":"2024-06-18T16:21:30.568059Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}
{"level":"debug","time":"2024-06-18T16:21:39.938825Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}
{"level":"debug","time":"2024-06-18T16:21:40.902317Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}
{"level":"debug","time":"2024-06-18T16:26:16.249571Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}
{"level":"debug","time":"2024-06-18T16:26:20.757449Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}
{"level":"debug","time":"2024-06-18T16:26:51.153676Z","scope":"delta","msg":"Skipping push to rabbitmq-0.rabbitmq-17, no updates required"}

@Stono
Copy link
Contributor Author

Stono commented Jun 18, 2024

Screenshot 2024-06-18 at 17 28 04

it's alway the same workloads, in each environment. There's got to be some sort of correlating factor, but we can't spot it.

@Stono
Copy link
Contributor Author

Stono commented Jun 18, 2024

I'll run with ISTIO_DELTA_XDS=false to rule delta in or out

@Stono
Copy link
Contributor Author

Stono commented Jun 18, 2024

Argh scratch that, i have to do it on the proxies and not pilot?
Is there any way to disable delta at the pilot level?

@howardjohn
Copy link
Member

No, it needs to be on the proxies. But you can configure that cluster wide in meshConfig as defaultConfig.proxyMetadata

@Stono
Copy link
Contributor Author

Stono commented Jun 18, 2024

Does meshConfig.proxyMetadata merge with any pod level proxyMetadata?

Feature request for flags like this, it'd be really great if they were control plane level. Historically we've always been able to configure push configuration there, if this was a production impacting incident it'd be much nicer for us to be able to opt out of the behaviour at the pilot level, than redeploying 1000's of workloads. The fact that there's even an opt-out signifies that there was sufficient enough concern it might break something.

@howardjohn
Copy link
Member

In this case it cannot really be controlled at the control plane level since the protocol is initiated by the proxy

@Stono
Copy link
Contributor Author

Stono commented Jun 18, 2024

Ah, OK.

@Stono
Copy link
Contributor Author

Stono commented Jun 20, 2024

If it helps the one thing I've noticed is it happens during the day when there's more cluster churn (things being deployed). Overnight, it goes away. So it's certainly related to deployments (not materially changing the istio spec, just new pods) causing config pushes.

@howardjohn
Copy link
Member

For full clarity, in slack discussions with @Stono it was discussed this was happening on 1.22 even without Delta, but at a much lower frequency. The fix in #52278 impacts only delta, and the same bug cannot occur in Sotw mode. Will follow up on that when more info is available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upgrade Issues related to upgrades
Projects
Status: Done
Prioritization
Awaiting triage
Development

Successfully merging a pull request may close this issue.

4 participants