`retentionPolicy.completed` actually means `succeeded` #12135

devstewart · 2023-11-03T20:50:12Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Submitted workflow which is expected to fail and it does, but the workflow gets deleted immediately.

Within seconds I get workflow gone.

I expected the failed workflows to remain as they did in earlier versions. I do not have the workflow-controller-configmap configured to do this. Relevant configuration:

completed: 400
spec:
  activeDeadlineSeconds: 86400
  podGC:
    strategy: OnPodSuccess

Version

v3.5.0

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  namespace: argo
  generateName: hello-world-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
spec:
  entrypoint: helloworld
  templates:
  - name: helloworld
    container:
      image: ghcr.io/openzipkin/alpine:3.18.2
      command: ["/bin/sh"]
      args: ["-c", "exit 1"]

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

time="2023-11-03T20:38:15.027Z" level=info msg="Processing workflow" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.032Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.033Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.033Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.033Z" level=info msg="Pod node hello-world-59xpp initialized Pending" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.057Z" level=info msg="Created pod: hello-world-59xpp (hello-world-59xpp)" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.057Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.057Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.070Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=1703572159 workflow=hello-world-59xpp
time="2023-11-03T20:38:25.061Z" level=info msg="Processing workflow" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.062Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=hello-world-59xpp
time="2023-11-03T20:38:25.062Z" level=info msg="Pod failed: Error (exit code 1)" displayName=hello-world-59xpp namespace=argo pod=hello-world-59xpp templateName=helloworld workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="node changed" namespace=argo new.message="Error (exit code 1)" new.phase=Failed new.progress=0/1 nodeID=hello-world-59xpp old.message= old.phase=Pending old.progress=0/1 workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="Updated message  -> Error (exit code 1)" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="Marking workflow completed" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.068Z" level=info msg="cleaning up pod" action=deletePod key=argo/hello-world-59xpp-1340600742-agent/deletePod
time="2023-11-03T20:38:25.071Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=1703572339 workflow=hello-world-59xpp
time="2023-11-03T20:38:25.073Z" level=info msg="Queueing Failed workflow argo/hello-world-59xpp for delete due to max rention(0 workflows)"
time="2023-11-03T20:38:25.073Z" level=info msg="Deleting garbage collected workflow 'argo/hello-world-59xpp'"
time="2023-11-03T20:38:25.081Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/hello-world-59xpp/labelPodCompleted
time="2023-11-03T20:38:25.081Z" level=info msg="Successfully request 'argo/hello-world-59xpp' to be deleted"

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

time="2023-11-03T20:38:19.525Z" level=info msg="Creating minio client using static credentials" endpoint=deap-api.decloud.xxx.com
time="2023-11-03T20:38:19.525Z" level=info msg="Saving file to s3" bucket=deap-argo-prod endpoint=deap-api.decloud.xxx.com key="dev\\ /2023\\ /11\\ /03\\ /hello-world-59xpp\\ /hello-world-59xpp\"/main.log" path=/tmp/argo/outputs/logs/main.log
time="2023-11-03T20:38:19.594Z" level=info msg="Save artifact" artifactName=main-logs duration=94.016454ms error="<nil>" key="dev\\ /2023\\ /11\\ /03\\ /hello-world-59xpp\\ /hello-world-59xpp\"/main.log"
time="2023-11-03T20:38:19.594Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-11-03T20:38:19.594Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-11-03T20:38:19.604Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:argo:default\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"argo\""
time="2023-11-03T20:38:19.605Z" level=warning msg="Non-transient error: pods \"hello-world-59xpp\" is forbidden: User \"system:serviceaccount:argo:default\" cannot patch resource \"pods\" in API group \"\" in the namespace \"argo\""
time="2023-11-03T20:38:19.606Z" level=error msg="executor error: pods \"hello-world-59xpp\" is forbidden: User \"system:serviceaccount:argo:default\" cannot patch resource \"pods\" in API group \"\" in the namespace \"argo\""
time="2023-11-03T20:38:19.606Z" level=info msg="Alloc=12580 TotalAlloc=18686 Sys=30309 NumGC=4 Goroutines=10"
time="2023-11-03T20:38:19.606Z" level=fatal msg="pods \"hello-world-59xpp\" is forbidden: User \"system:serviceaccount:argo:default\" cannot patch resource \"pods\" in API group \"\" in the namespace \"argo\""

The text was updated successfully, but these errors were encountered:

tooptoop4 · 2023-11-04T06:24:31Z

@devstewart do u have retentionpolicy block like

argo-workflows/manifests/quick-start-minimal.yaml

Line 1448 in f88ae88

retentionPolicy: |

?

agilgur5 · 2023-11-04T18:37:08Z

Looks like this is a follow-up to #12082

@tooptoop4 indeed, per that discussion, they do have a retentionPolicy

devstewart · 2023-11-06T16:15:06Z

@devstewart do u have retentionpolicy block like

argo-workflows/manifests/quick-start-minimal.yaml

Line 1448 in f88ae88

retentionPolicy: |

?

Yes, our retention policy is set to 400, but we only get about 100 - 200 new workflows a day and the failed workflows are deleted immediately.

devstewart · 2023-11-06T16:34:52Z

Here's the yaml for our workflow-controller-configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: workflow-controller-configmap
  namespace: argo
data:
  artifactRepository: |-
    # archiveLogs will archive the main container logs as an artifact
    archiveLogs: true

    s3:
      endpoint: deap-api.xxx.com
      bucket: deap-argo-prod
      insecure: false
      keyFormat: stage\
        /{{workflow.creationTimestamp.Y}}\
        /{{workflow.creationTimestamp.m}}\
        /{{workflow.creationTimestamp.d}}\
        /{{workflow.name}}\
        /{{pod.name}}"
      accessKeySecret:
        name: argo-minio-credentials
        key: accessKey
      secretKeySecret:
        name: argo-minio-credentials
        key: secretKey
  executor: |
    imagePullPolicy: IfNotPresent
    image: xxx.com/deap/argoexec:3.5.0.282
  retentionPolicy: |
    completed: 400
  workflowDefaults: |
    spec:
      activeDeadlineSeconds: 86400
      podGC:
        strategy: OnPodSuccess

Does completed actually mean completed, or is it actually success?

I added:

failed: 50
errored: 50

and now it sort of works as expected.

@tooptoop4

devstewart · 2023-11-07T05:05:58Z

I believe I found the offending code:

argo-workflows/workflow/gccontroller/gc_controller.go

Lines 140 to 160 in 80cd2ec

 func (c *Controller) runGC(phase wfv1.WorkflowPhase) { 

 defer runtimeutil.HandleCrash(runtimeutil.PanicHandlers...) 

 var maxWorkflows int 

 switch phase { 

 case wfv1.WorkflowSucceeded: 

 maxWorkflows = c.retentionPolicy.Completed 

 case wfv1.WorkflowFailed: 

 maxWorkflows = c.retentionPolicy.Failed 

 case wfv1.WorkflowError: 

 maxWorkflows = c.retentionPolicy.Errored 

 default: 

 return 

 } 

 for c.orderedQueue[phase].Len() > maxWorkflows { 

 key, _ := cache.MetaNamespaceKeyFunc(heap.Pop(c.orderedQueue[phase])) 

 log.Infof("Queueing %v workflow %s for delete due to max rention(%d workflows)", phase, key, maxWorkflows) 

 c.workqueue.Add(key) 

 <-ticker.C 

 } 

 }

I believe the meaning of completed and successful, should not be interchangeable, and the code should be a retentionPolicy.Succeeded in addition to retentionPolicy.Completed. retentionPolicy.Completed would allow a max retention policy regardless of success or failure.

So if this is the definition was:

retentionPolicy: |
    completed: 400
    succeeded: 400
    failed: 50
    errored: 50

Then a max of 400 successful workflows would be kept, or 50 failed or errored or the oldest from all 3 queues in the event that there's 300 successful, 50 failed, and 50 errored workflows. Maybe that's overly complicated, and completed should be replaced with succeeded, but that would be a breaking change.

agilgur5 · 2023-11-07T16:31:39Z

Nice find @devstewart. Your interpretation sounds correct to me, and the spec refers to "Completed" as "Succeeded, Failed, or Errored" in many places.

This misnomer seems to have been missed in #6854. The original issue, #5369, does not mention "Succeeded" either, but does seem to correctly allude to "Completed" as a superset.

I'm open to either renaming or making "Completed" actually work as seemingly originally intended (and thereby adding a new "Succeeded" as well).

If we rename, we can make it backward-compatible for the time being by accepting both names and warning when it is the incorrect "Completed" name.

Making "Completed" work I think can be done without any new data or indexes. Just take the total of all 3 queues, and if needed, compare the oldest 3 to each other and GC the oldest of the 3. That may need some careful locking though if there are concurrent reads/writes to the queues.

@devstewart would you like to fix this/implement that?

devstewart · 2023-11-07T16:42:05Z

@agilgur5

@devstewart would you like to fix this/implement that?

I wouldn't mind attempting to do either. I haven't contributed to Argo before, so I'll get setup and see how far I can get. It's a fairly small and seemingly confined piece of part of the code base so it seems like an okay first contribution, and it would be ideal if "completed" functioned as a superset here.

agilgur5 · 2023-11-07T18:12:31Z

It's a fairly small and seemingly confined piece of part of the code base so it seems like an okay first contribution

yup, my thoughts exactly! and you already found the root cause, which is often (but not always) the more difficult half of solving the problem. if you've got the root cause down, coding the solution can sometimes be pretty straightforward, which I think it is in this case.

agilgur5 · 2024-01-07T16:41:51Z

Hi @devstewart have you made any progress on this?
I'd like to fix this bug, so if you don't have plans to work on it soon, I'll take it on myself

devstewart added the type/bug label Nov 3, 2023

agilgur5 added area/controller Controller issues, panics P3 Low priority labels Nov 4, 2023

agilgur5 added the solution/suggested A solution to the bug has been suggested. Someone needs to implement it. label Dec 29, 2023

agilgur5 self-assigned this Jan 7, 2024

agilgur5 added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Feb 18, 2024

agilgur5 changed the title ~~Failing workflows deleted get deleted following pod failure.~~ retentionPolicy.completed actually means succeeded Mar 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`retentionPolicy.completed` actually means `succeeded` #12135

`retentionPolicy.completed` actually means `succeeded` #12135

devstewart commented Nov 3, 2023 •

edited

Loading

tooptoop4 commented Nov 4, 2023

agilgur5 commented Nov 4, 2023

devstewart commented Nov 6, 2023

devstewart commented Nov 6, 2023 •

edited

Loading

devstewart commented Nov 7, 2023

agilgur5 commented Nov 7, 2023

devstewart commented Nov 7, 2023

agilgur5 commented Nov 7, 2023 •

edited

Loading

agilgur5 commented Jan 7, 2024

retentionPolicy.completed actually means succeeded #12135

retentionPolicy.completed actually means succeeded #12135

Comments

devstewart commented Nov 3, 2023 • edited Loading

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

tooptoop4 commented Nov 4, 2023

agilgur5 commented Nov 4, 2023

devstewart commented Nov 6, 2023

devstewart commented Nov 6, 2023 • edited Loading

devstewart commented Nov 7, 2023

agilgur5 commented Nov 7, 2023

devstewart commented Nov 7, 2023

agilgur5 commented Nov 7, 2023 • edited Loading

agilgur5 commented Jan 7, 2024

`retentionPolicy.completed` actually means `succeeded` #12135

`retentionPolicy.completed` actually means `succeeded` #12135

devstewart commented Nov 3, 2023 •

edited

Loading

devstewart commented Nov 6, 2023 •

edited

Loading

agilgur5 commented Nov 7, 2023 •

edited

Loading