-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retentionPolicy.completed
actually means succeeded
#12135
Comments
@devstewart do u have retentionpolicy block like argo-workflows/manifests/quick-start-minimal.yaml Line 1448 in f88ae88
|
Looks like this is a follow-up to #12082 @tooptoop4 indeed, per that discussion, they do have a |
Yes, our retention policy is set to 400, but we only get about 100 - 200 new workflows a day and the failed workflows are deleted immediately. |
Here's the apiVersion: v1
kind: ConfigMap
metadata:
name: workflow-controller-configmap
namespace: argo
data:
artifactRepository: |-
# archiveLogs will archive the main container logs as an artifact
archiveLogs: true
s3:
endpoint: deap-api.xxx.com
bucket: deap-argo-prod
insecure: false
keyFormat: stage\
/{{workflow.creationTimestamp.Y}}\
/{{workflow.creationTimestamp.m}}\
/{{workflow.creationTimestamp.d}}\
/{{workflow.name}}\
/{{pod.name}}"
accessKeySecret:
name: argo-minio-credentials
key: accessKey
secretKeySecret:
name: argo-minio-credentials
key: secretKey
executor: |
imagePullPolicy: IfNotPresent
image: xxx.com/deap/argoexec:3.5.0.282
retentionPolicy: |
completed: 400
workflowDefaults: |
spec:
activeDeadlineSeconds: 86400
podGC:
strategy: OnPodSuccess
Does I added: failed: 50
errored: 50 and now it sort of works as expected. |
I believe I found the offending code: argo-workflows/workflow/gccontroller/gc_controller.go Lines 140 to 160 in 80cd2ec
I believe the meaning of completed and successful, should not be interchangeable, and the code should be a So if this is the definition was: retentionPolicy: |
completed: 400
succeeded: 400
failed: 50
errored: 50 Then a max of 400 successful workflows would be kept, or 50 failed or errored or the oldest from all 3 queues in the event that there's 300 successful, 50 failed, and 50 errored workflows. Maybe that's overly complicated, and |
Nice find @devstewart. Your interpretation sounds correct to me, and the spec refers to "Completed" as "Succeeded, Failed, or Errored" in many places. This misnomer seems to have been missed in #6854. The original issue, #5369, does not mention "Succeeded" either, but does seem to correctly allude to "Completed" as a superset. I'm open to either renaming or making "Completed" actually work as seemingly originally intended (and thereby adding a new "Succeeded" as well). If we rename, we can make it backward-compatible for the time being by accepting both names and warning when it is the incorrect "Completed" name. Making "Completed" work I think can be done without any new data or indexes. Just take the total of all 3 queues, and if needed, compare the oldest 3 to each other and GC the oldest of the 3. That may need some careful locking though if there are concurrent reads/writes to the queues. @devstewart would you like to fix this/implement that? |
I wouldn't mind attempting to do either. I haven't contributed to Argo before, so I'll get setup and see how far I can get. It's a fairly small and seemingly confined piece of part of the code base so it seems like an okay first contribution, and it would be ideal if "completed" functioned as a superset here. |
yup, my thoughts exactly! and you already found the root cause, which is often (but not always) the more difficult half of solving the problem. if you've got the root cause down, coding the solution can sometimes be pretty straightforward, which I think it is in this case. |
Hi @devstewart have you made any progress on this? |
retentionPolicy.completed
actually means succeeded
Pre-requisites
:latest
What happened/what you expected to happen?
Submitted workflow which is expected to fail and it does, but the workflow gets deleted immediately.
Within seconds I get workflow gone.
I expected the failed workflows to remain as they did in earlier versions. I do not have the
workflow-controller-configmap
configured to do this. Relevant configuration:Version
v3.5.0
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: