Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retentionPolicy.completed actually means succeeded #12135

Open
2 of 3 tasks
devstewart opened this issue Nov 3, 2023 · 9 comments
Open
2 of 3 tasks

retentionPolicy.completed actually means succeeded #12135

devstewart opened this issue Nov 3, 2023 · 9 comments
Assignees
Labels
area/controller Controller issues, panics area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more P3 Low priority solution/suggested A solution to the bug has been suggested. Someone needs to implement it. type/bug

Comments

@devstewart
Copy link

devstewart commented Nov 3, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Submitted workflow which is expected to fail and it does, but the workflow gets deleted immediately.

image

Within seconds I get workflow gone.

I expected the failed workflows to remain as they did in earlier versions. I do not have the workflow-controller-configmap configured to do this. Relevant configuration:

completed: 400
spec:
  activeDeadlineSeconds: 86400
  podGC:
    strategy: OnPodSuccess

Version

v3.5.0

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  namespace: argo
  generateName: hello-world-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
spec:
  entrypoint: helloworld
  templates:
  - name: helloworld
    container:
      image: ghcr.io/openzipkin/alpine:3.18.2
      command: ["/bin/sh"]
      args: ["-c", "exit 1"]

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

time="2023-11-03T20:38:15.027Z" level=info msg="Processing workflow" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.032Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.033Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.033Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.033Z" level=info msg="Pod node hello-world-59xpp initialized Pending" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.057Z" level=info msg="Created pod: hello-world-59xpp (hello-world-59xpp)" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.057Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.057Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.070Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=1703572159 workflow=hello-world-59xpp
time="2023-11-03T20:38:25.061Z" level=info msg="Processing workflow" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.062Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=hello-world-59xpp
time="2023-11-03T20:38:25.062Z" level=info msg="Pod failed: Error (exit code 1)" displayName=hello-world-59xpp namespace=argo pod=hello-world-59xpp templateName=helloworld workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="node changed" namespace=argo new.message="Error (exit code 1)" new.phase=Failed new.progress=0/1 nodeID=hello-world-59xpp old.message= old.phase=Pending old.progress=0/1 workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="Updated message  -> Error (exit code 1)" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="Marking workflow completed" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.068Z" level=info msg="cleaning up pod" action=deletePod key=argo/hello-world-59xpp-1340600742-agent/deletePod
time="2023-11-03T20:38:25.071Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=1703572339 workflow=hello-world-59xpp
time="2023-11-03T20:38:25.073Z" level=info msg="Queueing Failed workflow argo/hello-world-59xpp for delete due to max rention(0 workflows)"
time="2023-11-03T20:38:25.073Z" level=info msg="Deleting garbage collected workflow 'argo/hello-world-59xpp'"
time="2023-11-03T20:38:25.081Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/hello-world-59xpp/labelPodCompleted
time="2023-11-03T20:38:25.081Z" level=info msg="Successfully request 'argo/hello-world-59xpp' to be deleted"

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

time="2023-11-03T20:38:19.525Z" level=info msg="Creating minio client using static credentials" endpoint=deap-api.decloud.xxx.com
time="2023-11-03T20:38:19.525Z" level=info msg="Saving file to s3" bucket=deap-argo-prod endpoint=deap-api.decloud.xxx.com key="dev\\ /2023\\ /11\\ /03\\ /hello-world-59xpp\\ /hello-world-59xpp\"/main.log" path=/tmp/argo/outputs/logs/main.log
time="2023-11-03T20:38:19.594Z" level=info msg="Save artifact" artifactName=main-logs duration=94.016454ms error="<nil>" key="dev\\ /2023\\ /11\\ /03\\ /hello-world-59xpp\\ /hello-world-59xpp\"/main.log"
time="2023-11-03T20:38:19.594Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-11-03T20:38:19.594Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-11-03T20:38:19.604Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:argo:default\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"argo\""
time="2023-11-03T20:38:19.605Z" level=warning msg="Non-transient error: pods \"hello-world-59xpp\" is forbidden: User \"system:serviceaccount:argo:default\" cannot patch resource \"pods\" in API group \"\" in the namespace \"argo\""
time="2023-11-03T20:38:19.606Z" level=error msg="executor error: pods \"hello-world-59xpp\" is forbidden: User \"system:serviceaccount:argo:default\" cannot patch resource \"pods\" in API group \"\" in the namespace \"argo\""
time="2023-11-03T20:38:19.606Z" level=info msg="Alloc=12580 TotalAlloc=18686 Sys=30309 NumGC=4 Goroutines=10"
time="2023-11-03T20:38:19.606Z" level=fatal msg="pods \"hello-world-59xpp\" is forbidden: User \"system:serviceaccount:argo:default\" cannot patch resource \"pods\" in API group \"\" in the namespace \"argo\""
@tooptoop4
Copy link
Contributor

@devstewart do u have retentionpolicy block like

?

@agilgur5
Copy link
Member

agilgur5 commented Nov 4, 2023

Looks like this is a follow-up to #12082

@tooptoop4 indeed, per that discussion, they do have a retentionPolicy

@agilgur5 agilgur5 added area/controller Controller issues, panics P3 Low priority labels Nov 4, 2023
@devstewart
Copy link
Author

@devstewart do u have retentionpolicy block like

?

Yes, our retention policy is set to 400, but we only get about 100 - 200 new workflows a day and the failed workflows are deleted immediately.

@devstewart
Copy link
Author

devstewart commented Nov 6, 2023

Here's the yaml for our workflow-controller-configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: workflow-controller-configmap
  namespace: argo
data:
  artifactRepository: |-
    # archiveLogs will archive the main container logs as an artifact
    archiveLogs: true

    s3:
      endpoint: deap-api.xxx.com
      bucket: deap-argo-prod
      insecure: false
      keyFormat: stage\
        /{{workflow.creationTimestamp.Y}}\
        /{{workflow.creationTimestamp.m}}\
        /{{workflow.creationTimestamp.d}}\
        /{{workflow.name}}\
        /{{pod.name}}"
      accessKeySecret:
        name: argo-minio-credentials
        key: accessKey
      secretKeySecret:
        name: argo-minio-credentials
        key: secretKey
  executor: |
    imagePullPolicy: IfNotPresent
    image: xxx.com/deap/argoexec:3.5.0.282
  retentionPolicy: |
    completed: 400
  workflowDefaults: |
    spec:
      activeDeadlineSeconds: 86400
      podGC:
        strategy: OnPodSuccess

Does completed actually mean completed, or is it actually success?

I added:

failed: 50
errored: 50

and now it sort of works as expected.

@tooptoop4

@devstewart
Copy link
Author

I believe I found the offending code:

func (c *Controller) runGC(phase wfv1.WorkflowPhase) {
defer runtimeutil.HandleCrash(runtimeutil.PanicHandlers...)
var maxWorkflows int
switch phase {
case wfv1.WorkflowSucceeded:
maxWorkflows = c.retentionPolicy.Completed
case wfv1.WorkflowFailed:
maxWorkflows = c.retentionPolicy.Failed
case wfv1.WorkflowError:
maxWorkflows = c.retentionPolicy.Errored
default:
return
}
for c.orderedQueue[phase].Len() > maxWorkflows {
key, _ := cache.MetaNamespaceKeyFunc(heap.Pop(c.orderedQueue[phase]))
log.Infof("Queueing %v workflow %s for delete due to max rention(%d workflows)", phase, key, maxWorkflows)
c.workqueue.Add(key)
<-ticker.C
}
}

I believe the meaning of completed and successful, should not be interchangeable, and the code should be a retentionPolicy.Succeeded in addition to retentionPolicy.Completed. retentionPolicy.Completed would allow a max retention policy regardless of success or failure.

So if this is the definition was:

retentionPolicy: |
    completed: 400
    succeeded: 400
    failed: 50
    errored: 50

Then a max of 400 successful workflows would be kept, or 50 failed or errored or the oldest from all 3 queues in the event that there's 300 successful, 50 failed, and 50 errored workflows. Maybe that's overly complicated, and completed should be replaced with succeeded, but that would be a breaking change.

@agilgur5
Copy link
Member

agilgur5 commented Nov 7, 2023

Nice find @devstewart. Your interpretation sounds correct to me, and the spec refers to "Completed" as "Succeeded, Failed, or Errored" in many places.

This misnomer seems to have been missed in #6854. The original issue, #5369, does not mention "Succeeded" either, but does seem to correctly allude to "Completed" as a superset.

I'm open to either renaming or making "Completed" actually work as seemingly originally intended (and thereby adding a new "Succeeded" as well).

If we rename, we can make it backward-compatible for the time being by accepting both names and warning when it is the incorrect "Completed" name.

Making "Completed" work I think can be done without any new data or indexes. Just take the total of all 3 queues, and if needed, compare the oldest 3 to each other and GC the oldest of the 3. That may need some careful locking though if there are concurrent reads/writes to the queues.

@devstewart would you like to fix this/implement that?

@devstewart
Copy link
Author

@agilgur5

@devstewart would you like to fix this/implement that?

I wouldn't mind attempting to do either. I haven't contributed to Argo before, so I'll get setup and see how far I can get. It's a fairly small and seemingly confined piece of part of the code base so it seems like an okay first contribution, and it would be ideal if "completed" functioned as a superset here.

@agilgur5
Copy link
Member

agilgur5 commented Nov 7, 2023

It's a fairly small and seemingly confined piece of part of the code base so it seems like an okay first contribution

yup, my thoughts exactly! and you already found the root cause, which is often (but not always) the more difficult half of solving the problem. if you've got the root cause down, coding the solution can sometimes be pretty straightforward, which I think it is in this case.

@agilgur5 agilgur5 added the solution/suggested A solution to the bug has been suggested. Someone needs to implement it. label Dec 29, 2023
@agilgur5
Copy link
Member

agilgur5 commented Jan 7, 2024

Hi @devstewart have you made any progress on this?
I'd like to fix this bug, so if you don't have plans to work on it soon, I'll take it on myself

@agilgur5 agilgur5 self-assigned this Jan 7, 2024
@agilgur5 agilgur5 added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Feb 18, 2024
@agilgur5 agilgur5 changed the title Failing workflows deleted get deleted following pod failure. retentionPolicy.completed actually means succeeded Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more P3 Low priority solution/suggested A solution to the bug has been suggested. Someone needs to implement it. type/bug
Projects
None yet
Development

No branches or pull requests

3 participants