Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argo fails to list workflows when there is a big workflow in the list (more then ~30,000 steps/nodes) #6206

Closed
dpeer6 opened this issue Jun 24, 2021 · 12 comments · Fixed by #6420

Comments

@dpeer6
Copy link
Contributor

dpeer6 commented Jun 24, 2021

Summary

When there is a big workflow, in any state, "argo list" fails with a similar message:
rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (238029029 vs. 104857600)

We are running many workflows, some of the big, with many steps/pods running, but when there is even only one with more then 25k+ items in the progress status, argo fails to list current workflows in the namespace... sometime the UI also fails to list it.

I can run the following command on other workflows, but not on the big one:
"argo get/delete/terminate {workflow}"

It does not happen when I use kubectl to get the current workflows, for example:
"kubectl get workflow -n argo"

I can also get the workflow progress status, running:
"kubectl get workflow {workflow} -n argo -o json | jq .status.progress"

General info:
We do use podGC policy, so completed pods and deleted.
We do archive and offload immediately completed workflows to a Postgre DB.
We do delete workflows every few hours so the workflow list kept short.

We use S3 to save Argo workflows logs and Artifacts.

What happened/what you expected to happen?
I would expect that argo cli won't fail to list argo workflows
I would expect that argo cli won't fail to get a big argo workflow, supporting paging or some other method

We are so happy with Argo and would like to understand if the message size limit can be adjusted/changed somehow, to support huge workflows.

Diagnostics

👀 Yes! We need all of your diagnostics, please make sure you add it all, otherwise we'll go around in circles asking you for it:

What Kubernetes provider are you using?
EKS 1.18

What version of Argo Workflows are you running?
3.0.7

What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary
Docker

Did this work in a previous version? I.e. is it a regression?
N/A

Are you pasting thousands of log lines? That's too much information.

# Either a workflow that reproduces the bug, or paste you whole workflow YAML, including status, something like:
kubectl get wf -o yaml ${workflow}

# Logs from the workflow controller:
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Sorry, but the log is huge and contain only info messages that all looks like normal and proper operation.

# The workflow's pods that are problematic:
kubectl get pod -o yaml -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

# Logs from in your workflow's wait container, something like:
kubectl logs -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@sarabala1979
Copy link
Member

Argo API Server configured 104857600 as max size. we can parameterize it as an environment variable in argo-server. Would you interest to contribute this enhancement?

MaxGRPCMessageSize = 100 * 1024 * 1024

@alexec alexec added this to the v3.0 milestone Jun 24, 2021
@alexec
Copy link
Contributor

alexec commented Jun 24, 2021

We could also consider stripping managedFields from the API as per #5778 .

@alexec
Copy link
Contributor

alexec commented Jun 24, 2021

Oh - I think I made a performance improvement to the UI for something similar: #6004 which was ported to v3.1

@alexec
Copy link
Contributor

alexec commented Jun 24, 2021

@dpeer6 can you please test in v3.1?

@dpeer6
Copy link
Contributor Author

dpeer6 commented Jun 27, 2021

Hi Guys,

Thanks for your time.

I have few questions:

  1. Yes, I can test it with 3.1 but, UI fixes does not fix the argo cli issue, right?
  2. If I build argo server with larger message size, will the cli/ui can handle it? will it support paginated data fetching or something similar, properly?
  3. Yes, I would love to contribute the code to support larger message size, but I would love to implement it as an argument to argo server on startup time (using ENV variable or something similar). Not sure it is possible though currently, without massive change given the fact that I need to learn go :-)

@alexec
Copy link
Contributor

alexec commented Jun 27, 2021

Let me know how your testing goes!

@dpeer6
Copy link
Contributor Author

dpeer6 commented Jun 29, 2021

Hi,

Our testings with 3.1.1 showed improvement with the UI, it did not fail/crash or timeout when we run 4-5 workflows, at the same time, with 20k nodes/steps/pods each, but the cli did returned with the same issue, when running: "argo list"
FATA[2021-06-29T12:07:24.513Z] rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (214553096 vs. 104857600)

and at some point later I also got this reply using the CLI: ("argo list")
FATA[2021-06-29T12:19:21.808Z] Get "http:https://{out-aws-eks-load_balancer}:port/api/v1/workflows/argo?": EOF

I managed to add argoserver.go support for MaxGRPCMessageSize via ENV variable to argo-server deployment, if exist, if not, use the default of 100MB.
I will have to drop the constant definition and use a func instead to support that.

Our test from above worked with no issues when adding this functionality, with both UI and CLI.

If you think this is the right/good direction, I can open a PR or share the diff, as long as our use case can be supported.
We for sure need to support big workflows, that might scale up to 100's of instances (sometimes even more), 1000's of pods and if we can avoid building argo from source, it will be great.

Many thanks!
David

@alexec
Copy link
Contributor

alexec commented Jun 29, 2021

“argo list” is likely hydrating workflows (ie going to the database to get the node status) when they do not need to. Like the UI, is should pass the “fields” parameter to prevent this.

This should be a one line change here:

wfList, err := serviceClient.ListWorkflows(ctx, &workflowpkg.WorkflowListRequest{Namespace: flags.namespace, ListOptions: listOpts})

Should be like this:

@alexec
Copy link
Contributor

alexec commented Jun 29, 2021

It’d be great if you want to submit and PR and I expect we can then backport to v3.1.2

@dpeer6
Copy link
Contributor Author

dpeer6 commented Jun 29, 2021

Cool, then!
I will work on it tomorrow, hope you will find it suitable.

@sarabala1979 sarabala1979 modified the milestones: v3.0, v3.2 Jun 30, 2021
@dpeer6
Copy link
Contributor Author

dpeer6 commented Jun 30, 2021

Hi,
I created the PR, hope it's ok.

@dpeer6
Copy link
Contributor Author

dpeer6 commented Jul 21, 2021

I updated the PR using the init func.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

3 participants