-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
argo fails to list workflows when there is a big workflow in the list (more then ~30,000 steps/nodes) #6206
Comments
Argo API Server configured 104857600 as max size. we can parameterize it as an environment variable in argo-server. Would you interest to contribute this enhancement?
|
We could also consider stripping managedFields from the API as per #5778 . |
Oh - I think I made a performance improvement to the UI for something similar: #6004 which was ported to v3.1 |
@dpeer6 can you please test in v3.1? |
Hi Guys, Thanks for your time. I have few questions:
|
Let me know how your testing goes! |
Hi, Our testings with 3.1.1 showed improvement with the UI, it did not fail/crash or timeout when we run 4-5 workflows, at the same time, with 20k nodes/steps/pods each, but the cli did returned with the same issue, when running: "argo list" and at some point later I also got this reply using the CLI: ("argo list") I managed to add argoserver.go support for MaxGRPCMessageSize via ENV variable to argo-server deployment, if exist, if not, use the default of 100MB. Our test from above worked with no issues when adding this functionality, with both UI and CLI. If you think this is the right/good direction, I can open a PR or share the diff, as long as our use case can be supported. Many thanks! |
“argo list” is likely hydrating workflows (ie going to the database to get the node status) when they do not need to. Like the UI, is should pass the “fields” parameter to prevent this. This should be a one line change here: argo-workflows/cmd/argo/commands/list.go Line 112 in 4321259
Should be like this:
|
It’d be great if you want to submit and PR and I expect we can then backport to v3.1.2 |
Cool, then! |
Hi, |
I updated the PR using the init func. |
Summary
When there is a big workflow, in any state, "argo list" fails with a similar message:
rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (238029029 vs. 104857600)
We are running many workflows, some of the big, with many steps/pods running, but when there is even only one with more then 25k+ items in the progress status, argo fails to list current workflows in the namespace... sometime the UI also fails to list it.
I can run the following command on other workflows, but not on the big one:
"argo get/delete/terminate {workflow}"
It does not happen when I use kubectl to get the current workflows, for example:
"kubectl get workflow -n argo"
I can also get the workflow progress status, running:
"kubectl get workflow {workflow} -n argo -o json | jq .status.progress"
General info:
We do use podGC policy, so completed pods and deleted.
We do archive and offload immediately completed workflows to a Postgre DB.
We do delete workflows every few hours so the workflow list kept short.
We use S3 to save Argo workflows logs and Artifacts.
What happened/what you expected to happen?
I would expect that argo cli won't fail to list argo workflows
I would expect that argo cli won't fail to get a big argo workflow, supporting paging or some other method
We are so happy with Argo and would like to understand if the message size limit can be adjusted/changed somehow, to support huge workflows.
Diagnostics
👀 Yes! We need all of your diagnostics, please make sure you add it all, otherwise we'll go around in circles asking you for it:
What Kubernetes provider are you using?
EKS 1.18
What version of Argo Workflows are you running?
3.0.7
What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary
Docker
Did this work in a previous version? I.e. is it a regression?
N/A
Are you pasting thousands of log lines? That's too much information.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: