-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: onExit step gets stuck if the workflow exceeds activeDeadlineSeconds. Fixes #2603 #2605
fix: onExit step gets stuck if the workflow exceeds activeDeadlineSeconds. Fixes #2603 #2605
Conversation
@simster7 do you want to take a look? |
Codecov Report
@@ Coverage Diff @@
## master #2605 +/- ##
==========================================
- Coverage 11.16% 11.16% -0.01%
==========================================
Files 83 83
Lines 32673 32675 +2
==========================================
- Hits 3649 3648 -1
- Misses 28525 28530 +5
+ Partials 499 497 -2
Continue to review full report at Codecov.
|
Yes, please don't merge until I've had a chance to look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also unable to reproduce this behavior, I used the Workflow you provided in the issue, the Workflow you provided in this test case, and even this Workflow to attempt to reproduce:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: exit-handler-sleep
spec:
entrypoint: intentional-fail
activeDeadlineSeconds: 10
onExit: exit-handler
templates:
- name: intentional-fail
container:
image: alpine:latest
command: [sh, -c]
args: ["echo intentional failure; sleep 20; exit 1"]
- name: exit-handler
container:
image: alpine:latest
command: [sh, -c]
args: ["echo send e-mail: {{workflow.name}} {{workflow.status}}."]
They all finish.
workflow/controller/workflowpod.go
Outdated
@@ -107,7 +107,7 @@ func (woc *wfOperationCtx) createWorkflowPod(nodeName string, mainCtr apiv1.Cont | |||
} else { | |||
wfActiveDeadlineSeconds := int64((*wfDeadline).Sub(time.Now().UTC()).Seconds()) | |||
if wfActiveDeadlineSeconds < 0 { | |||
return nil, nil | |||
return nil, fmt.Errorf("Scheduling pod after workflow deadline %s", wfDeadline) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I'm not sure this should be an error.
For one, this would make it such that no Pods (even those in onExit
nodes) could be scheduled after the deadline has passed. I'm not entirely convinced, but I think the default behavior should be that onExit
nodes still run after the deadline has passed. I'm thinking of teardown code that needs to be run regardless of timeouts.
Secondly, returning an error here would guarantee that the Workflow finishes with an Error
phase. I don't think this should be the case. If a workflow fails because of activeDeadlineSeconds
, it should finish with a Failed
phase as the cause was internal to the Workflow executing and was not an error of executing the Workflow itself.
I had a play and if the exit handler runs quickly enough then wfActiveDeadlineSeconds is zero at workflowpod.go:109 in which case I get an error like:
If at least a second passes before it reaches workflowpod.go:109 then wfActiveDeadlineSeconds is negative and workflowpod.go:110 is triggered. If we're fine with deadline not applying to onExit, how about we modify it to:
|
@mark9white Ah yes, that would be the correct fix if we wanted to go that route. Let me bring it up with the team today and get feedback on whether |
Great. My 2 cents is that it should still run, but applying deadlines set
within it.
…On Tue, 7 Apr 2020 at 16:00, Simon Behar ***@***.***> wrote:
@mark9white <https://github.com/mark9white> Ah yes, that would be the
correct fix if we wanted to go that route.
Let me bring it up with the team today and get feedback on whether onExit
should run after activeDeadlineSeconds has passed. I'll let you know what
they think and we can go with the approach that is applicable.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2605 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABO7LWHEAFJLNZU336SULDRLM5YZANCNFSM4MCM2MPQ>
.
|
I'll definitely bring this up! |
Thinking about the:
So this never creates a pod, we end up with a pending node, what would clean up that node? Normally it would be when the pod is iterated by execution control, but as we never created one that doesn't happen ... |
Hey @mark9white, we are going with the approach where the
Happens here: |
ed5d3ec
to
6803e6c
Compare
Great, I've just adjusted this PR so it doesn't time out onExit pods. The nil, nil results in a Pending node, which I'm afraid the code you pasted skips over in line 790. |
I had to trigger build because of a failure in TestCLISuite/TestLogProblems but as far as I can see it was a flake and has no relation to this PR. So from my pov, this is ready to squash and merge. |
Oh you're right... seems to be a fairly recent change: https://github.com/argoproj/argo/pull/2385/files#diff-fcb04129f1d32e69cca32631c6586587R757 that I'd forgotten about. Good to have in mind in case any issues arise |
Back-ported to 2.7 |
We had feedback from our users that they're surprised that onExit handlers do not run when Workflow is terminated. onExit handlers usually has cleanup routines (cluster deletion, etc). |
FYI: Argo 2.6+ now has a |
Checklist:
"fix(controller): Updates such and such. Fixes #1234"
.