argo fails to list workflows when there is a big workflow in the list (more then ~30,000 steps/nodes) #6206

dpeer6 · 2021-06-24T06:04:12Z

Summary

When there is a big workflow, in any state, "argo list" fails with a similar message:
rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (238029029 vs. 104857600)

We are running many workflows, some of the big, with many steps/pods running, but when there is even only one with more then 25k+ items in the progress status, argo fails to list current workflows in the namespace... sometime the UI also fails to list it.

I can run the following command on other workflows, but not on the big one:
"argo get/delete/terminate {workflow}"

It does not happen when I use kubectl to get the current workflows, for example:
"kubectl get workflow -n argo"

I can also get the workflow progress status, running:
"kubectl get workflow {workflow} -n argo -o json | jq .status.progress"

General info:
We do use podGC policy, so completed pods and deleted.
We do archive and offload immediately completed workflows to a Postgre DB.
We do delete workflows every few hours so the workflow list kept short.

We use S3 to save Argo workflows logs and Artifacts.

What happened/what you expected to happen?
I would expect that argo cli won't fail to list argo workflows
I would expect that argo cli won't fail to get a big argo workflow, supporting paging or some other method

We are so happy with Argo and would like to understand if the message size limit can be adjusted/changed somehow, to support huge workflows.

Diagnostics

👀 Yes! We need all of your diagnostics, please make sure you add it all, otherwise we'll go around in circles asking you for it:

What Kubernetes provider are you using?
EKS 1.18

What version of Argo Workflows are you running?
3.0.7

What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary
Docker

Did this work in a previous version? I.e. is it a regression?
N/A

Are you pasting thousands of log lines? That's too much information.

# Either a workflow that reproduces the bug, or paste you whole workflow YAML, including status, something like:
kubectl get wf -o yaml ${workflow}

# Logs from the workflow controller:
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Sorry, but the log is huge and contain only info messages that all looks like normal and proper operation.

# The workflow's pods that are problematic:
kubectl get pod -o yaml -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

# Logs from in your workflow's wait container, something like:
kubectl logs -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

sarabala1979 · 2021-06-24T15:26:52Z

Argo API Server configured 104857600 as max size. we can parameterize it as an environment variable in argo-server. Would you interest to contribute this enhancement?

argo-workflows/server/apiserver/argoserver.go

Line 62 in 0e94283

MaxGRPCMessageSize = 100 * 1024 * 1024

alexec · 2021-06-24T23:51:02Z

We could also consider stripping managedFields from the API as per #5778 .

alexec · 2021-06-24T23:53:17Z

Oh - I think I made a performance improvement to the UI for something similar: #6004 which was ported to v3.1

alexec · 2021-06-24T23:53:30Z

@dpeer6 can you please test in v3.1?

dpeer6 · 2021-06-27T07:56:39Z

Hi Guys,

Thanks for your time.

I have few questions:

Yes, I can test it with 3.1 but, UI fixes does not fix the argo cli issue, right?
If I build argo server with larger message size, will the cli/ui can handle it? will it support paginated data fetching or something similar, properly?
Yes, I would love to contribute the code to support larger message size, but I would love to implement it as an argument to argo server on startup time (using ENV variable or something similar). Not sure it is possible though currently, without massive change given the fact that I need to learn go :-)

alexec · 2021-06-27T20:39:21Z

Let me know how your testing goes!

dpeer6 · 2021-06-29T12:15:36Z

Hi,

Our testings with 3.1.1 showed improvement with the UI, it did not fail/crash or timeout when we run 4-5 workflows, at the same time, with 20k nodes/steps/pods each, but the cli did returned with the same issue, when running: "argo list"
FATA[2021-06-29T12:07:24.513Z] rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (214553096 vs. 104857600)

and at some point later I also got this reply using the CLI: ("argo list")
FATA[2021-06-29T12:19:21.808Z] Get "http:https://{out-aws-eks-load_balancer}:port/api/v1/workflows/argo?": EOF

I managed to add argoserver.go support for MaxGRPCMessageSize via ENV variable to argo-server deployment, if exist, if not, use the default of 100MB.
I will have to drop the constant definition and use a func instead to support that.

Our test from above worked with no issues when adding this functionality, with both UI and CLI.

If you think this is the right/good direction, I can open a PR or share the diff, as long as our use case can be supported.
We for sure need to support big workflows, that might scale up to 100's of instances (sometimes even more), 1000's of pods and if we can avoid building argo from source, it will be great.

Many thanks!
David

alexec · 2021-06-29T14:55:00Z

“argo list” is likely hydrating workflows (ie going to the database to get the node status) when they do not need to. Like the UI, is should pass the “fields” parameter to prevent this.

This should be a one line change here:

argo-workflows/cmd/argo/commands/list.go

Line 112 in 4321259

 wfList, err := serviceClient.ListWorkflows(ctx, &workflowpkg.WorkflowListRequest{Namespace: flags.namespace, ListOptions: listOpts}) 

Should be like this:

argo-workflows/ui/src/app/shared/services/workflows-service.ts

Line 26 in 4321259

fields = [

alexec · 2021-06-29T14:55:39Z

It’d be great if you want to submit and PR and I expect we can then backport to v3.1.2

dpeer6 · 2021-06-29T15:11:08Z

Cool, then!
I will work on it tomorrow, hope you will find it suitable.

dpeer6 · 2021-06-30T13:40:21Z

Hi,
I created the PR, hope it's ok.

dpeer6 · 2021-07-21T15:26:47Z

I updated the PR using the init func.

dpeer6 added the type/bug label Jun 24, 2021

alexec added this to the v3.0 milestone Jun 24, 2021

alexec added the more-information-needed label Jun 26, 2021

no-response bot removed the more-information-needed label Jun 29, 2021

sarabala1979 modified the milestones: v3.0, v3.2 Jun 30, 2021

alexec linked a pull request Jul 27, 2021 that will close this issue

feat(server): support changing MaxGRPCMessageSize using env variable #6420

Merged

1 task

alexec closed this as completed in #6420 Jul 28, 2021

sarabala1979 added this to To do in Run The Business (incl. bugs) Sep 15, 2021

alexec moved this from To do to Done in Run The Business (incl. bugs) Sep 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

argo fails to list workflows when there is a big workflow in the list (more then ~30,000 steps/nodes) #6206

argo fails to list workflows when there is a big workflow in the list (more then ~30,000 steps/nodes) #6206

dpeer6 commented Jun 24, 2021

sarabala1979 commented Jun 24, 2021

alexec commented Jun 24, 2021

alexec commented Jun 24, 2021

alexec commented Jun 24, 2021

dpeer6 commented Jun 27, 2021

alexec commented Jun 27, 2021

dpeer6 commented Jun 29, 2021 •

edited

Loading

alexec commented Jun 29, 2021

alexec commented Jun 29, 2021

dpeer6 commented Jun 29, 2021

dpeer6 commented Jun 30, 2021

dpeer6 commented Jul 21, 2021

argo fails to list workflows when there is a big workflow in the list (more then ~30,000 steps/nodes) #6206

argo fails to list workflows when there is a big workflow in the list (more then ~30,000 steps/nodes) #6206

Comments

dpeer6 commented Jun 24, 2021

Summary

Diagnostics

sarabala1979 commented Jun 24, 2021

alexec commented Jun 24, 2021

alexec commented Jun 24, 2021

alexec commented Jun 24, 2021

dpeer6 commented Jun 27, 2021

alexec commented Jun 27, 2021

dpeer6 commented Jun 29, 2021 • edited Loading

alexec commented Jun 29, 2021

alexec commented Jun 29, 2021

dpeer6 commented Jun 29, 2021

dpeer6 commented Jun 30, 2021

dpeer6 commented Jul 21, 2021

dpeer6 commented Jun 29, 2021 •

edited

Loading