Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argo-server repeatable crash due to out of memory on big workflows #6000

Closed
amitm02 opened this issue May 25, 2021 · 13 comments · Fixed by #6004 or #6298
Closed

argo-server repeatable crash due to out of memory on big workflows #6000

amitm02 opened this issue May 25, 2021 · 13 comments · Fixed by #6004 or #6298
Labels
Milestone

Comments

@amitm02
Copy link
Contributor

amitm02 commented May 25, 2021

Summary

I've allocated argo-server (V3.1-rc7) 2.2GB of RAM.
When workflow reaches around ~6000 nodes, the Argo server crash with OOM.
After the crash, it keeps restarting due to OOM.
Using "watch" I was able to monitor the memory usage of argo-server from 10MB to around 1.4GB before it crashes.
I understand that argo-server will need more RAM for bigger workflows, but 2.2GB seems like a pretty big allocation.

time="2021-05-25T10:20:50.474Z" level=info authModes="[server client]" baseHRef=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/ managedNamespace=argo namespace=argo secure=true
time="2021-05-25T10:20:50.474Z" level=info msg="config map" name=workflow-controller-configmap
time="2021-05-25T10:20:50.474Z" level=info msg="SSO disabled"
time="2021-05-25T10:20:50.486Z" level=info msg="Starting Argo Server" instanceID= version=v3.1.0-rc7
time="2021-05-25T10:20:50.486Z" level=info msg="Creating DB session"
time="2021-05-25T10:20:50.500Z" level=info msg="Node status offloading config" ttl=5m0s
time="2021-05-25T10:20:50.501Z" level=info msg="Creating event controller" operationQueueSize=16 workerCount=4
time="2021-05-25T10:20:50.507Z" level=info msg="Argo Server started successfully on https://localhost:2746"
time="2021-05-25T10:25:50.471Z" level=info msg="Alloc=8184 TotalAlloc=18322 Sys=73553 NumGC=6 Goroutines=93"
time="2021-05-25T10:25:57.166Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListWorkflowTemplates grpc.service=workflowtemplate.WorkflowTemplateService grpc.start_time="2021-05-25T10:25:57Z" grpc.time_ms=41.138 span.kind=server system=grpc
time="2021-05-25T10:25:57.168Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListCronWorkflows grpc.service=cronworkflow.CronWorkflowService grpc.start_time="2021-05-25T10:25:57Z" grpc.time_ms=21.007 span.kind=server system=grpc
time="2021-05-25T10:25:57.263Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetInfo grpc.service=info.InfoService grpc.start_time="2021-05-25T10:25:57Z" grpc.time_ms=0.081 span.kind=server system=grpc
time="2021-05-25T10:25:57.396Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:25:57Z" grpc.time_ms=132.886 span.kind=server system=grpc
time="2021-05-25T10:25:57.580Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetVersion grpc.service=info.InfoService grpc.start_time="2021-05-25T10:25:57Z" grpc.time_ms=0.11 span.kind=server system=grpc
time="2021-05-25T10:26:15.762Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:15Z" grpc.time_ms=107.927 span.kind=server system=grpc
time="2021-05-25T10:26:16.158Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:15Z" grpc.time_ms=504.674 span.kind=server system=grpc
time="2021-05-25T10:26:16.160Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:15Z" grpc.time_ms=507.07 span.kind=server system=grpc
time="2021-05-25T10:26:16.266Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:16Z" grpc.time_ms=92.15 span.kind=server system=grpc
time="2021-05-25T10:26:16.273Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:16Z" grpc.time_ms=12.042 span.kind=server system=grpc
time="2021-05-25T10:26:16.358Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:16Z" grpc.time_ms=90.825 span.kind=server system=grpc
time="2021-05-25T10:26:16.359Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:16Z" grpc.time_ms=99.319 span.kind=server system=grpc
time="2021-05-25T10:26:16.360Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:16Z" grpc.time_ms=100.189 span.kind=server system=grpc
time="2021-05-25T10:26:16.360Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:16Z" grpc.time_ms=92.885 span.kind=server system=grpc
time="2021-05-25T10:26:16.361Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:16Z" grpc.time_ms=94.709 span.kind=server system=grpc
time="2021-05-25T10:26:16.362Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:16Z" grpc.time_ms=95.387 span.kind=server system=grpc
time="2021-05-25T10:26:16.365Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:16Z" grpc.time_ms=92.359 span.kind=server system=grpc
time="2021-05-25T10:26:16.366Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:16Z" grpc.time_ms=98.918 span.kind=server system=grpc
time="2021-05-25T10:26:16.366Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:16Z" grpc.time_ms=93.911 span.kind=server system=grpc
time="2021-05-25T10:26:16.608Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:25:57Z" grpc.time_ms=18921.3 span.kind=server system=grpc
time="2021-05-25T10:26:16.612Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:16Z" grpc.time_ms=4.624 span.kind=server system=grpc
time="2021-05-25T10:30:50.471Z" level=info msg="Alloc=21029 TotalAlloc=118745 Sys=74193 NumGC=15 Goroutines=100"
time="2021-05-25T10:35:50.471Z" level=info msg="Alloc=53995 TotalAlloc=2189725 Sys=143124 NumGC=106 Goroutines=100"
time="2021-05-25T10:40:50.471Z" level=info msg="Alloc=89586 TotalAlloc=6012167 Sys=211815 NumGC=209 Goroutines=100"
time="2021-05-25T10:45:50.471Z" level=info msg="Alloc=111260 TotalAlloc=9272606 Sys=347821 NumGC=262 Goroutines=100"
time="2021-05-25T10:46:45.693Z" level=info msg="finished streaming call with code Canceled" error="rpc error: code = Canceled desc = context canceled" grpc.code=Canceled grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:26:16Z" grpc.time_ms=1.228909e+06 span.kind=server system=grpc
time="2021-05-25T10:46:46.265Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListWorkflowTemplates grpc.service=workflowtemplate.WorkflowTemplateService grpc.start_time="2021-05-25T10:46:46Z" grpc.time_ms=13.126 span.kind=server system=grpc
time="2021-05-25T10:46:46.360Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetInfo grpc.service=info.InfoService grpc.start_time="2021-05-25T10:46:46Z" grpc.time_ms=0.082 span.kind=server system=grpc
time="2021-05-25T10:46:46.365Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListCronWorkflows grpc.service=cronworkflow.CronWorkflowService grpc.start_time="2021-05-25T10:46:46Z" grpc.time_ms=2.369 span.kind=server system=grpc
time="2021-05-25T10:46:46.757Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetVersion grpc.service=info.InfoService grpc.start_time="2021-05-25T10:46:46Z" grpc.time_ms=0.07 span.kind=server system=grpc
time="2021-05-25T10:47:05.796Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:46:46Z" grpc.time_ms=19427.557 span.kind=server system=grpc
time="2021-05-25T10:50:50.560Z" level=info msg="Alloc=223668 TotalAlloc=16597930 Sys=1448338 NumGC=337 Goroutines=101"
time="2021-05-25T10:55:50.470Z" level=info msg="Alloc=170449 TotalAlloc=30753200 Sys=1448338 NumGC=504 Goroutines=101"
time="2021-05-25T11:00:50.471Z" level=info msg="Alloc=112543 TotalAlloc=45584482 Sys=1448338 NumGC=673 Goroutines=101"
time="2021-05-25T11:05:50.471Z" level=info msg="Alloc=58310 TotalAlloc=60018368 Sys=1448338 NumGC=837 Goroutines=101"
time="2021-05-25T11:10:50.471Z" level=info msg="Alloc=230839 TotalAlloc=74778864 Sys=1448338 NumGC=978 Goroutines=101"
time="2021-05-25T11:15:50.471Z" level=info msg="Alloc=272255 TotalAlloc=89222709 Sys=1448594 NumGC=1108 Goroutines=101"
time="2021-05-25T11:20:50.471Z" level=info msg="Alloc=278787 TotalAlloc=105610555 Sys=1448850 NumGC=1222 Goroutines=101"
time="2021-05-25T11:23:12.873Z" level=error msg="finished streaming call with code Unknown" error=EOF grpc.code=Unknown grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T10:47:06Z" grpc.time_ms=2.1667878e+06 span.kind=server system=grpc
time="2021-05-25T11:25:50.471Z" level=info msg="Alloc=11805 TotalAlloc=110634076 Sys=1448850 NumGC=1258 Goroutines=101"
time="2021-05-25T11:30:50.471Z" level=info msg="Alloc=11591 TotalAlloc=111472443 Sys=1448850 NumGC=1268 Goroutines=101"
time="2021-05-25T11:35:50.471Z" level=info msg="Alloc=12185 TotalAlloc=111475399 Sys=1448850 NumGC=1270 Goroutines=101"
time="2021-05-25T11:40:50.471Z" level=info msg="Alloc=11587 TotalAlloc=111478318 Sys=1448850 NumGC=1273 Goroutines=101"
time="2021-05-25T11:45:50.471Z" level=info msg="Alloc=25338 TotalAlloc=111556046 Sys=1448850 NumGC=1279 Goroutines=101"
time="2021-05-25T11:50:50.471Z" level=info msg="Alloc=69397 TotalAlloc=113618824 Sys=1448850 NumGC=1361 Goroutines=101"
time="2021-05-25T11:55:50.471Z" level=info msg="Alloc=123311 TotalAlloc=116972746 Sys=1448850 NumGC=1448 Goroutines=101"
time="2021-05-25T12:00:50.471Z" level=info msg="Alloc=214425 TotalAlloc=120155411 Sys=1449106 NumGC=1496 Goroutines=101"
time="2021-05-25T12:04:52.405Z" level=error msg="finished streaming call with code Internal" error="rpc error: code = Internal desc = an error on the server (\"unable to decode an event from the watch stream: context canceled\") has prevented the request from succeeding" grpc.code=Internal grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T11:23:18Z" grpc.time_ms=2.4941345e+06 span.kind=server system=grpc
time="2021-05-25T12:05:50.471Z" level=info msg="Alloc=274119 TotalAlloc=126366697 Sys=1449362 NumGC=1566 Goroutines=101"
time="2021-05-25T12:05:55.820Z" level=info msg="finished streaming call with code Canceled" error="rpc error: code = Canceled desc = context canceled" grpc.code=Canceled grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2021-05-25T12:04:52Z" grpc.time_ms=63396.02 span.kind=server system=grpc
time="2021-05-25T12:05:55.872Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListWorkflowTemplates grpc.service=workflowtemplate.WorkflowTemplateService grpc.start_time="2021-05-25T12:05:55Z" grpc.time_ms=11.872 span.kind=server system=grpc
time="2021-05-25T12:05:55.872Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListCronWorkflows grpc.service=cronworkflow.CronWorkflowService grpc.start_time="2021-05-25T12:05:55Z" grpc.time_ms=2.451 span.kind=server system=grpc
Name:         argo-server-555db8fb54-46gtj
Namespace:    argo
Priority:     0
Node:         gke-argo-dev-pool-1-60041216-q1c6/10.128.15.197
Start Time:   Tue, 25 May 2021 13:20:46 +0300
Labels:       app=argo-server
              pod-template-hash=555db8fb54
Annotations:  <none>
Status:       Running
IP:           10.96.1.20
IPs:
  IP:           10.96.1.20
Controlled By:  ReplicaSet/argo-server-555db8fb54
Containers:
  argo-server:
    Container ID:  containerd:https://9fce6725641117e629ff7ea4bd56a3a1d4af497dce9ac6af69d03d6d1df51a76
    Image:         argoproj/argocli:v3.1.0-rc7
    Image ID:      docker.io/argoproj/argocli@sha256:2ed4251f5453a48d77efe6748776b2aed75907ee8dac65ada19738aa9a9581f7
    Port:          2746/TCP
    Host Port:     0/TCP
    Args:
      server
      --namespaced
      --auth-mode
      server
      --auth-mode
      client
    State:          Running
      Started:      Tue, 25 May 2021 15:06:31 +0300
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 25 May 2021 13:20:50 +0300
      Finished:     Tue, 25 May 2021 15:06:30 +0300
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     500m
      memory:  1600Mi
    Requests:
      cpu:      500m
      memory:   1600Mi
    Readiness:  http-get https://:2746/ delay=10s timeout=1s period=20s #success=1 #failure=3
    Environment:
      BASE_HREF:  /
    Mounts:
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from argo-server-token-nfbsn (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  argo-server-token-nfbsn:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  argo-server-token-nfbsn
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason   Age                   From     Message
  ----    ------   ----                  ----     -------
  Normal  Created  2m32s (x2 over 108m)  kubelet  Created container argo-server
  Normal  Started  2m32s (x2 over 108m)  kubelet  Started container argo-server
  Normal  Pulled   2m32s                 kubelet  Container image "argoproj/argocli:v3.1.0-rc7" already present on machine

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@amitm02
Copy link
Contributor Author

amitm02 commented May 25, 2021

UPDATE:
Refreshing the browser (url: https://:/workflows?limit=500), causes the memory spike and crash.
Here is the memory monitoring (when argo-server was allocated with 1.6GB).

Tue May 25 16:40:26 IDT 2021        argo-server-555db8fb54-rwrzm           2m           15Mi            
Tue May 25 16:40:29 IDT 2021        argo-server-555db8fb54-rwrzm           2m           15Mi            
Tue May 25 16:40:31 IDT 2021        argo-server-555db8fb54-rwrzm           2m           15Mi            
Tue May 25 16:40:34 IDT 2021        argo-server-555db8fb54-rwrzm           2m           15Mi            
Tue May 25 16:40:36 IDT 2021        argo-server-555db8fb54-rwrzm           2m           15Mi       
-- browser refresh     
Tue May 25 16:40:39 IDT 2021        argo-server-555db8fb54-rwrzm           59m          1350Mi          
Tue May 25 16:40:41 IDT 2021        argo-server-555db8fb54-rwrzm           59m          1350Mi          
Tue May 25 16:40:44 IDT 2021        argo-server-555db8fb54-rwrzm           59m          1350Mi          
Tue May 25 16:40:46 IDT 2021        argo-server-555db8fb54-rwrzm           59m          1350Mi          
Tue May 25 16:40:48 IDT 2021        argo-server-555db8fb54-rwrzm           59m          1350Mi          
Tue May 25 16:40:51 IDT 2021        argo-server-555db8fb54-rwrzm           59m          1350Mi          
Tue May 25 16:40:53 IDT 2021        argo-server-555db8fb54-rwrzm           59m          1350Mi          
Tue May 25 16:40:56 IDT 2021        argo-server-555db8fb54-rwrzm           59m          1350Mi          
Tue May 25 16:40:58 IDT 2021        argo-server-555db8fb54-rwrzm           59m          1350Mi          
Tue May 25 16:41:00 IDT 2021        argo-server-555db8fb54-rwrzm           59m          1350Mi     
-- OOM crash        
Tue May 25 16:41:03 IDT 2021        
Tue May 25 16:41:05 IDT 2021        
Tue May 25 16:41:08 IDT 2021        
Tue May 25 16:41:10 IDT 2021        
Tue May 25 16:41:12 IDT 2021        
Tue May 25 16:41:15 IDT 2021        
Tue May 25 16:41:17 IDT 2021        
Tue May 25 16:41:20 IDT 2021        
Tue May 25 16:41:22 IDT 2021        
Tue May 25 16:41:24 IDT 2021        
Tue May 25 16:41:27 IDT 2021        
Tue May 25 16:41:29 IDT 2021        
Tue May 25 16:41:32 IDT 2021        
Tue May 25 16:41:34 IDT 2021        argo-server-555db8fb54-rwrzm           31m          16Mi            
Tue May 25 16:41:37 IDT 2021        argo-server-555db8fb54-rwrzm           31m          16Mi            
Tue May 25 16:41:39 IDT 2021        argo-server-555db8fb54-rwrzm           31m          16Mi            
Tue May 25 16:41:42 IDT 2021        argo-server-555db8fb54-rwrzm           31m          16Mi  
Tue May 25 16:36:54 IDT 2021        argo-server-555db8fb54-rwrzm           273m         158Mi           
Tue May 25 16:36:57 IDT 2021        argo-server-555db8fb54-rwrzm           273m         158Mi           
Tue May 25 16:36:59 IDT 2021        argo-server-555db8fb54-rwrzm           273m         158Mi 
-- browser refresh           
Tue May 25 16:37:03 IDT 2021        argo-server-555db8fb54-rwrzm           115m         717Mi           
Tue May 25 16:37:05 IDT 2021        argo-server-555db8fb54-rwrzm           115m         717Mi           
Tue May 25 16:37:07 IDT 2021        argo-server-555db8fb54-rwrzm           115m         717Mi           
Tue May 25 16:37:10 IDT 2021        argo-server-555db8fb54-rwrzm           115m         717Mi           
Tue May 25 16:37:13 IDT 2021        argo-server-555db8fb54-rwrzm           115m         717Mi           
Tue May 25 16:37:15 IDT 2021        argo-server-555db8fb54-rwrzm           115m         717Mi           
Tue May 25 16:37:18 IDT 2021        argo-server-555db8fb54-rwrzm           115m         717Mi           
Tue May 25 16:37:20 IDT 2021        argo-server-555db8fb54-rwrzm           115m         717Mi           
Tue May 25 16:37:23 IDT 2021        argo-server-555db8fb54-rwrzm           115m         717Mi           
Tue May 25 16:37:25 IDT 2021        argo-server-555db8fb54-rwrzm           115m         717Mi           
Tue May 25 16:37:27 IDT 2021        argo-server-555db8fb54-rwrzm           115m         717Mi           
Tue May 25 16:37:30 IDT 2021        argo-server-555db8fb54-rwrzm           115m         717Mi           
Tue May 25 16:37:32 IDT 2021        argo-server-555db8fb54-rwrzm           115m         717Mi      
-- OOM crash          
Tue May 25 16:37:36 IDT 2021        argo-server-555db8fb54-rwrzm           1m           16Mi            
Tue May 25 16:37:38 IDT 2021        argo-server-555db8fb54-rwrzm           1m           16Mi            
Tue May 25 16:37:40 IDT 2021        argo-server-555db8fb54-rwrzm           1m           16Mi            
Tue May 25 16:37:43 IDT 2021        argo-server-555db8fb54-rwrzm           1m           16Mi 

after increasing argo-server allocation to 2.2GB:

Tue May 25 16:58:27 IDT 2021        argo-server-5d9f5f68d4-kn96d           2m           16Mi            
Tue May 25 16:58:30 IDT 2021        argo-server-5d9f5f68d4-kn96d           2m           16Mi            
Tue May 25 16:58:33 IDT 2021        argo-server-5d9f5f68d4-kn96d           2m           16Mi            
Tue May 25 16:58:35 IDT 2021        argo-server-5d9f5f68d4-kn96d           2m           16Mi            
Tue May 25 16:58:37 IDT 2021        argo-server-5d9f5f68d4-kn96d           60m          623Mi           
Tue May 25 16:58:40 IDT 2021        argo-server-5d9f5f68d4-kn96d           60m          623Mi           
Tue May 25 16:58:42 IDT 2021        argo-server-5d9f5f68d4-kn96d           60m          623Mi           
Tue May 25 16:58:44 IDT 2021        argo-server-5d9f5f68d4-kn96d           60m          623Mi           
Tue May 25 16:58:47 IDT 2021        argo-server-5d9f5f68d4-kn96d           60m          623Mi           
Tue May 25 16:58:49 IDT 2021        argo-server-5d9f5f68d4-kn96d           60m          623Mi           
Tue May 25 16:58:52 IDT 2021        argo-server-5d9f5f68d4-kn96d           60m          623Mi           
Tue May 25 16:58:54 IDT 2021        argo-server-5d9f5f68d4-kn96d           60m          623Mi           
Tue May 25 16:58:56 IDT 2021        argo-server-5d9f5f68d4-kn96d           60m          623Mi           
Tue May 25 16:58:59 IDT 2021        argo-server-5d9f5f68d4-kn96d           60m          623Mi           
Tue May 25 16:59:01 IDT 2021        argo-server-5d9f5f68d4-kn96d           60m          623Mi           
Tue May 25 16:59:04 IDT 2021        argo-server-5d9f5f68d4-kn96d           60m          623Mi           
Tue May 25 16:59:06 IDT 2021        argo-server-5d9f5f68d4-kn96d           501m         1970Mi          
Tue May 25 16:59:08 IDT 2021        argo-server-5d9f5f68d4-kn96d           501m         1970Mi          
Tue May 25 16:59:11 IDT 2021        argo-server-5d9f5f68d4-kn96d           501m         1970Mi          
Tue May 25 16:59:13 IDT 2021        argo-server-5d9f5f68d4-kn96d           501m         1970Mi          
Tue May 25 16:59:16 IDT 2021        argo-server-5d9f5f68d4-kn96d           501m         1970Mi          
Tue May 25 16:59:18 IDT 2021        argo-server-5d9f5f68d4-kn96d           501m         1970Mi          
Tue May 25 16:59:21 IDT 2021        argo-server-5d9f5f68d4-kn96d           501m         1970Mi          
Tue May 25 16:59:23 IDT 2021        argo-server-5d9f5f68d4-kn96d           501m         1970Mi          
Tue May 25 16:59:26 IDT 2021        argo-server-5d9f5f68d4-kn96d           501m         1970Mi          
Tue May 25 16:59:29 IDT 2021        argo-server-5d9f5f68d4-kn96d           501m         1970Mi          
Tue May 25 16:59:31 IDT 2021        argo-server-5d9f5f68d4-kn96d           501m         1970Mi          
Tue May 25 16:59:34 IDT 2021        argo-server-5d9f5f68d4-kn96d           27m          243Mi           
Tue May 25 16:59:36 IDT 2021        argo-server-5d9f5f68d4-kn96d           27m          243Mi           
Tue May 25 16:59:38 IDT 2021        argo-server-5d9f5f68d4-kn96d           27m          243Mi           
Tue May 25 16:59:41 IDT 2021        argo-server-5d9f5f68d4-kn96d           27m          243Mi           
Tue May 25 16:59:43 IDT 2021        argo-server-5d9f5f68d4-kn96d           27m          243Mi           
Tue May 25 16:59:46 IDT 2021        argo-server-5d9f5f68d4-kn96d           27m          243Mi           
Tue May 25 16:59:48 IDT 2021        argo-server-5d9f5f68d4-kn96d           27m          243Mi           
Tue May 25 16:59:50 IDT 2021        argo-server-5d9f5f68d4-kn96d           27m          243Mi           
Tue May 25 16:59:53 IDT 2021        argo-server-5d9f5f68d4-kn96d           27m          243Mi           
Tue May 25 16:59:55 IDT 2021        argo-server-5d9f5f68d4-kn96d           27m          243Mi           
Tue May 25 16:59:58 IDT 2021        argo-server-5d9f5f68d4-kn96d           27m          243Mi           
Tue May 25 17:00:00 IDT 2021        argo-server-5d9f5f68d4-kn96d           27m          243Mi           
Tue May 25 17:00:02 IDT 2021        argo-server-5d9f5f68d4-kn96d           27m          243Mi           
Tue May 25 17:00:05 IDT 2021        argo-server-5d9f5f68d4-kn96d           113m         1029Mi          
Tue May 25 17:00:07 IDT 2021        argo-server-5d9f5f68d4-kn96d           113m         1029Mi          
Tue May 25 17:00:10 IDT 2021        argo-server-5d9f5f68d4-kn96d           113m         1029Mi          
Tue May 25 17:00:12 IDT 2021        argo-server-5d9f5f68d4-kn96d           113m         1029Mi          
Tue May 25 17:00:14 IDT 2021        argo-server-5d9f5f68d4-kn96d           113m         1029Mi          
Tue May 25 17:00:17 IDT 2021        argo-server-5d9f5f68d4-kn96d           113m         1029Mi          
Tue May 25 17:00:19 IDT 2021        argo-server-5d9f5f68d4-kn96d           113m         1029Mi          
Tue May 25 17:00:21 IDT 2021        argo-server-5d9f5f68d4-kn96d           113m         1029Mi          
Tue May 25 17:00:24 IDT 2021        argo-server-5d9f5f68d4-kn96d           113m         1029Mi          
Tue May 25 17:00:27 IDT 2021        argo-server-5d9f5f68d4-kn96d           113m         1029Mi          
Tue May 25 17:00:30 IDT 2021        argo-server-5d9f5f68d4-kn96d           113m         1029Mi          
Tue May 25 17:00:32 IDT 2021        argo-server-5d9f5f68d4-kn96d           0m           2Mi             
Tue May 25 17:00:35 IDT 2021        argo-server-5d9f5f68d4-kn96d           0m           2Mi             
Tue May 25 17:00:37 IDT 2021        argo-server-5d9f5f68d4-kn96d           0m           2Mi             
Tue May 25 17:00:41 IDT 2021        argo-server-5d9f5f68d4-kn96d           0m           2Mi 

@alexec
Copy link
Contributor

alexec commented May 25, 2021

This sounds like a memory leak.

Can you confirm if you've tested this in v3.0 and found the same issue?

@alexec alexec added this to the v3.1 milestone May 25, 2021
@amitm02
Copy link
Contributor Author

amitm02 commented May 25, 2021

We believe so, but I can not confirm 100%.
Pre 3.1 we had many other issues with big workflows, the server did crash, but I'm can not confirm it was due to the same reason.
Our current workflow relies heavily on 3.1 features so I can not simply run it on 3.0.

@alexec
Copy link
Contributor

alexec commented May 25, 2021

@amitm02 if I create a debug build, could I ask you to test and capture some heap dumps?

alexec added a commit that referenced this issue May 25, 2021
@amitm02
Copy link
Contributor Author

amitm02 commented May 25, 2021 via email

@alexec
Copy link
Contributor

alexec commented May 25, 2021

Excellent.

Can you run argoproj/argocli:dev-pprof. You'll need to enable the profiling tool by using ARGO_SERVER_PPROF=true. E.g. ARGO_SERVER_PPROF=6060 ./dist/argo server.

You should see this printed in the logs:

starting server for pprof or :6060, see https://golang.org/pkg/net/http/pprof/

I'd like to see the "heap" and "allocs" profiles, both when the server is using not very much memory and when it is using a lot. Capture many if you like.

go tool pprof -svg https://localhost:6060/debug/pprof/heap
go tool pprof -svg https://localhost:6060/debug/pprof/allocs

Change "svg" to "png" if you have any problem (pngs are less detailed, but still useful).

These two should show any leaks.

@DO-YardenG
Copy link

DO-YardenG commented May 26, 2021

Hi,
Attaching profile svgs for the following:

  • normal load (53MB mem usage by server)
  • medium load (580MB mem usage)
  • High load (1288MB mem usage, right before crashing with OOM exceptions)
  • crash load (1GB+ mem, after server pods starts getting OOM killed and restarts, which happens over and over again

argo-issue-6000-svgs.zip

@alexec
Copy link
Contributor

alexec commented May 26, 2021

Thank you. I can tell from the dump that you are using node offloading, and this appears to be one potential cause, can I ask:

  • Is ALWAYS_OFFLOAD_NODE_STATUS=true. This is a debug flag and should NEVER be set.
  • How many workflows do you have?
  • How many nodes does each workflow have?
  • How are you listing/watching workflows? Via the UI, or are you using the API?

@DO-YardenG
Copy link

DO-YardenG commented May 26, 2021

Thanks for the swift reply!

  • We do not use the ALWAYS_OFFLOAD_NODE_STATUS flag at all
    (we do use nodeStatusOffLoad: true on our controller's configmap)
  • Currently we only ever run a single workflow at a time (though 1-2 may still be present in the DB.
    can confirm it happens even if the DB is empty)
  • Generally around 8K-25K nodes.
    the crash happens at around 4K-7K nodes with our current memory allocation of 2.2GB ram.
    (Each workflow we got the crash with has around 16-24 steps, looping over these steps a few thousand times)
  • We've generally only seen these crashes happen when watching through the UI (https://<SERVER>/workflows URL).
    refreshing the URL tends to trigger the issue.

@alexec
Copy link
Contributor

alexec commented May 26, 2021

@DO-YardenG My current hypothesis is JSON marshaling of offloaded data is causing your problem.

I've just pushed a fix to dev-pprof. This fix only "hydrates" workflows ("hydration" is when we go to the database to get the offloaded nodes) if those nodes will actually be returned via the API.

It can take up to 1h for your new image (argoproj/argocli:dev-pprof) to be published.

Could I please ask you to test to see if it fixed your issue? Please capture pprof output again.

@DO-YardenG
Copy link

Of course, i'll update once i finish testing! 👍

@DO-YardenG
Copy link

It does indeed fix the issue as far as i can tell! 😃
We are not getting any high memory peaks (nothing above 600MB), and the server does not crash consistently anymore while refreshing the UI. (it does still crash once in a while, but not for the same actions, i'll try and debug it separately.)

attached are the heap/allocs SVG's from our latest testing:

  • no-load (22MB, no workflows running)
  • normal load (35MB, 1 workflow running )
  • peak-load (around 600MB peak usage, while refreshing the UI )

argo-issue-6000-svgs-2705.zip

@alexec
Copy link
Contributor

alexec commented May 27, 2021

Awesome. Thank you for testing.

@sarabala1979 sarabala1979 mentioned this issue Jun 10, 2021
88 tasks
alexec added a commit that referenced this issue Jul 9, 2021
* fix(cli): Only list needed fields

Signed-off-by: Alex Collins <[email protected]>

* ok

Signed-off-by: Alex Collins <[email protected]>
@sarabala1979 sarabala1979 mentioned this issue Jul 13, 2021
17 tasks
uturunku1 pushed a commit to newrelic-forks/argo-workflows that referenced this issue Jul 22, 2021
* fix(cli): Only list needed fields

Signed-off-by: Alex Collins <[email protected]>

* ok

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>
alexec added a commit that referenced this issue Jul 26, 2021
* Update events.md (#6119)

Trying to use the argo workflows events and I noticed that some crucial explanations are missing here. I would like to add:
- A simple WorkflowTemplate bound to the WorkflowEventBinding, to show what is triggered by the curl that send the event
- Some infos about the process that bind the event to the workflow template:
   - template creation
   - event binding apply
   - api call to trigger the workflow template creation
Plus: there is a little mistake in the selector:  metadata["x-argo"] instead of metadata["X-Argo-E2E"] I would like to correct it in order to avoid mistakes during the curl.

Hope this is appreciated! ;)

Denis

Signed-off-by: uturunku1 <[email protected]>

* docs: Add note on the requirements of resource templates. Fixes #5566 (#6125)

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: updated CHANGELOG.md (#6127)

Signed-off-by: GitHub <[email protected]>

Co-authored-by: alexec <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* add troubleshooting notes section for running-locally docs (#6132)

Co-authored-by: uturunku1 <“[email protected]”>
Signed-off-by: uturunku1 <[email protected]>

* fix(executor): Check whether any errors within checkResourceState() are transient. Fixes #6118. (#6134)

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* build: Remove PNS_PRIVILEGED=true (#6138)

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: Document the extraction of data from a k8s resource (#6102)

* Document the extraction of data from a k8s resource

* remove reference to lines of a file that can be outdated

Co-authored-by: Yuan Tang <[email protected]>

* Remove yaml snippet and only keep the link to the example

Co-authored-by: Yuan Tang <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* build image output to docker (#6128)

Co-authored-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* chore: Update stress rig and docs. Fixes #6136 (#6141)

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* chore: Upgrade Alibaba OSS to use more secure ListObjectsV2() (#6142)

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix: Allow setting workflow input parameters in UI. Fixes #4234 (#5319)

* fix: Allow setting workflow input parameters in UI. Fixes #4234

Signed-off-by: Kenny Trytek <[email protected]>

* fix: Allow setting workflow input parameters in UI. Fixes #4234

 - Allow workflow input parameters as well as entrypoint parameters.

Signed-off-by: Kenny Trytek <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(controller): Performance improvement for Sprig. Fixes #6135 (#6140)

Signed-off-by: uturunku1 <[email protected]>

* update from v0.19.6 to v0.20.4 and indirect dependencies

Signed-off-by: uturunku1 <“[email protected]”>
Signed-off-by: uturunku1 <[email protected]>

* exec.GetAuthenticator takes two arguments in the k8s-client-go v0.20.4

Signed-off-by: uturunku1 <“[email protected]”>
Signed-off-by: uturunku1 <[email protected]>

* update makefile to use [email protected]

Signed-off-by: uturunku1 <“[email protected]”>
Signed-off-by: uturunku1 <[email protected]>

* docs: Fix release-notes.md

Signed-off-by: uturunku1 <[email protected]>

* docs: Update Graviti's website link (#6148)

Signed-off-by: uturunku1 <[email protected]>

* fix(ui): Fix-up local storage namespaces. Fixes #6109 (#6144)

Signed-off-by: uturunku1 <[email protected]>

* fix(executor): Capture emissary main-logs. Fixes #6145 (#6146)

Signed-off-by: uturunku1 <[email protected]>

* fix(ui): Fix event-flow scrolling. Fixes #6133 (#6147)

Signed-off-by: uturunku1 <[email protected]>

* test: Fix logging test (#6159)

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* feat(ui): Add checkbox to check all workflows in list. Fixes #6069 (#6158)

Signed-off-by: uturunku1 <[email protected]>

* docs: Use 'depends' instead of 'dependencies' in examples (#6166)

Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* feat(server): Allow redirect_uri to be automatically resolved when using sso (#6167)

Signed-off-by: uturunku1 <[email protected]>

* fix(controller): Allow retry on transient errors when validating workflow spec. Fixes #6163 (#6178)

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(controller): dehydrate workflow before deleting offloaded node status (#6112)

Signed-off-by: uturunku1 <[email protected]>

* docs: updated CHANGELOG.md (#6160)

Signed-off-by: GitHub <[email protected]>

Co-authored-by: alexec <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: Remove RBAC for SSO from Roadmap (Already implemented) (#6174)

It looks like RBAC for SSO is already implemented by #4198 so hopefully it can be removed from the roadmap as it is also documented? https://argoproj.github.io/argo-workflows/argo-server-sso/#sso-rbac

Signed-off-by: uturunku1 <[email protected]>

* docs: updated CHANGELOG.md (#6187)

Signed-off-by: GitHub <[email protected]>

Co-authored-by: alexec <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: Fix changelog order for .0 tags (#6188)

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(controller): Wrong validate order when validate DAG task's argument (#6190)

Signed-off-by: BOOK <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix rebase conflict

Signed-off-by: uturunku1 <[email protected]>

* run go mod tidy

Signed-off-by: uturunku1 <[email protected]>

* refactor: Remove the need for pod annotations to be mounted as a volume (#6022)

Signed-off-by: Antony Chazapis <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: ContainerSets do not have 'depends' (#6199)

Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix: Fix security issues related to file closing and paths (G307 & G304) (#6200)

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: Add links to Python examples to description annotations (#6202)

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs(executor): document k8s executor behaviour with program warnings (#6212)

* docs(executor): document k8s executor behaviour with program warnings

Signed-off-by: Tianchu Zhao <[email protected]>

* docs(executor): fix typo

Signed-off-by: Tianchu Zhao <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix: Fix certain sibling tasks not connected to parent (#6193)

Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* feat(ui): Add copy to clipboard shortcut (#6217)

Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: updated CHANGELOG.md (#6220)

Signed-off-by: GitHub <[email protected]>

Co-authored-by: alexec <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: Add KarrotPay in USERS.md (#6221)

Signed-off-by: Byungjin Park <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* run go mod tidy

Signed-off-by: uturunku1 <[email protected]>

* docs: Add workflow-count-resourcequota.yaml example (#6225)

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix: Reduce argoexec image size (#6197)

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(conttroller): Always set finishedAt dote. Fixes #6135 (#6139)

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* feat: Add support for deletion delay when using PodGC (#6168)

Signed-off-by: Stefan Sedich <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: update bug report template (#6236)

Signed-off-by: uturunku1 <[email protected]>

* docs: updated CHANGELOG.md (#6242)

Signed-off-by: GitHub <[email protected]>

Co-authored-by: alexec <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(executor): emissary - make argoexec executable from non-root containers. Fixes #6238 (#6247)

Signed-off-by: Yuan Gong <[email protected]>

Co-authored-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* feat: Introduce when condition to retryStrategy (#6114)

Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* ci: Add Go code security scanner via gosec. Fixes #6203 (#6232)

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: fix end of files, new lines and remove multiple lines (#6240)

Signed-off-by: NikeNano <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: add json destructuring example (#6250)


Signed-off-by: Michael Crenshaw <[email protected]>

Co-authored-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(executor): Tolerate docker re-creating containers. Fixes #6244 (#6252)

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(executor): emissary - make /var/run/argo files readable from non-root users. Fixes #6238 (#6304)

Signed-off-by: Yuan Gong <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs(controller): add missing emissary executor (#6291)

Signed-off-by: Tianchu Zhao <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: docs and hacks improvements (#6310)

Signed-off-by: Michael Crenshaw <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(cli): Only list needed fields. Fixes #6000 (#6298)

* fix(cli): Only list needed fields

Signed-off-by: Alex Collins <[email protected]>

* ok

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: Fix typo (#6311)

Signed-off-by: Byungjin Park <[email protected]>

Co-authored-by: Saravanan Balasubramanian <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* require sso redirect url to be an argo url (#6211)

Signed-off-by: Brandon Goode <[email protected]>

Co-authored-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: code format (#6269)

- Add yaml rendering
- Add bash rendering

Co-authored-by: Simon Behar <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* feat(controller): Store artifact repository in workflow status. Fixes #6255 (#6299)

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: document using ingress with TLS enabled (#6324)

Signed-off-by: valorl <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: document how to access hyphenated steps in expression templates (#6318)

Signed-off-by: Michael Crenshaw <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* feat(controller): Differentiate CronWorkflow submission vs invalid spec error metrics (#6309)

* feat(controller): Differentiate CronWorkflow submission vs invalid spec error metrics

Signed-off-by: Yuan Tang <[email protected]>

* Address feedback

Signed-off-by: Yuan Tang <[email protected]>

Co-authored-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* chore: deleted wft.yaml

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* ci: only run Snyk once a day on master

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(controller): Not updating StoredWorkflowSpec when WFT changed during workflow running (#6342)

Signed-off-by: Saravanan Balasubramanian <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(cli): v3.1 Argo Auth Token (#6344)

* fix(cli): v3.1 Argo Auth Token

Signed-off-by: Saravanan Balasubramanian <[email protected]>

* update

Signed-off-by: Saravanan Balasubramanian <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: Add Alibaba Group to USERS.md (#6353)

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(crd): temp fix 34s timeout bug for k8s 1.20+ (#6350)

Signed-off-by: Tianchu Zhao <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: updated CHANGELOG.md (#6348)

Signed-off-by: GitHub <[email protected]>

Co-authored-by: sarabala1979 <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs(users): Add WooliesX (#6358)

Signed-off-by: Tianchu Zhao <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(cli): Overridding name/generateName when creating CronWorkflows if specified (#6308)

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* feat(controller): sortDAGTasks supports sort by field Depends (#6307)

Signed-off-by: BOOK <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(fields): handle nexted fields when excluding (#6359)

Signed-off-by: AntoineDao <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* feat(controller): Allow configurable host name label key when retrying different hosts (#6341)

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* pull argo-events changes
update versions in go.mod and go.sum

Signed-off-by: uturunku1 <[email protected]>

* run go mod tidy

Signed-off-by: uturunku1 <[email protected]>

* fix(controller): allow workflow.duration to pass validator (#6376)

Signed-off-by: Tianchu Zhao <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(controller): fix retry on transient errors when validating workflow spec (#6370)

Signed-off-by: Tianchu Zhao <[email protected]>

Co-authored-by: Saravanan Balasubramanian <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix: examples/ci.yaml indent (#6328)

Signed-off-by: kungho.back <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* chore: import grafana dashboard (#6365)

Signed-off-by: GitHub <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(gcs): throw argo not found error if key not exist (#6393)

Signed-off-by: AntoineDao <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* Revert "fix: examples/ci.yaml indent (#6328)"

This reverts commit 3f72fe5.

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix: Server crash when opening timeline tab for big workflows (#6369)

Signed-off-by: Alexander Matyushentsev <[email protected]>

Co-authored-by: Saravanan Balasubramanian <[email protected]>
Co-authored-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: Add 4intelligence (#6400)

Signed-off-by: Thiago Gil <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: Add note on additional required permission for createBucketIfNotPresent for OSS driver (#6378)

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(controller): allow initial duration to be 0 instead of current_time-0 (#6389)


Signed-off-by: Tianchu Zhao <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* fix(controller): Mark workflows wait for semaphore as pending. Fixes #6351 (#6356)

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* docs: Updating upgrading.md. Closes #6314

Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: uturunku1 <[email protected]>

* not need to convert to unstructured.unstructured

I was getting this error controller_test.go: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: item[0]: can't assign or convert unstructured.Unstructured into v1alpha1.Workflow

Based on this comment, it seems like the conversion is not needed: kubernetes-sigs/controller-runtime#524 (comment)

Signed-off-by: uturunku1 <[email protected]>

* run make pre-commit -B

Signed-off-by: uturunku1 <[email protected]>

* fix potential file inclusion via variable lint error

there is a risk that an unintended file path will be specified. So uuse filepath.Clean() to clean up possible bad paths

Signed-off-by: uturunku1 <[email protected]>

* fix format issue

Signed-off-by: uturunku1 <[email protected]>

Co-authored-by: Denis Bellotti <[email protected]>
Co-authored-by: Yuan Tang <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: alexec <[email protected]>
Co-authored-by: uturunku1 <“[email protected]”>
Co-authored-by: Christophe Blin <[email protected]>
Co-authored-by: meijin <[email protected]>
Co-authored-by: kennytrytek <[email protected]>
Co-authored-by: Caden <[email protected]>
Co-authored-by: Simon Behar <[email protected]>
Co-authored-by: Stefan Sedich <[email protected]>
Co-authored-by: Reijer Copier <[email protected]>
Co-authored-by: Brandon High <[email protected]>
Co-authored-by: BOOK <[email protected]>
Co-authored-by: Antony Chazapis <[email protected]>
Co-authored-by: Tianchu Zhao <[email protected]>
Co-authored-by: Byungjin Park (Claud) <[email protected]>
Co-authored-by: Yuan (Bob) Gong <[email protected]>
Co-authored-by: Niklas Hansson <[email protected]>
Co-authored-by: Michael Crenshaw <[email protected]>
Co-authored-by: Saravanan Balasubramanian <[email protected]>
Co-authored-by: brgoode <[email protected]>
Co-authored-by: Valér Orlovský <[email protected]>
Co-authored-by: Alex Collins <[email protected]>
Co-authored-by: sarabala1979 <[email protected]>
Co-authored-by: Antoine Dao <[email protected]>
Co-authored-by: KUNG HO BACK <[email protected]>
Co-authored-by: Zadkiel <[email protected]>
Co-authored-by: Alexander Matyushentsev <[email protected]>
Co-authored-by: Thiago Bittencourt Gil <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants