fix: onExit step gets stuck if the workflow exceeds activeDeadlineSeconds. Fixes #2603 #2605

markterm · 2020-04-06T16:05:53Z

Checklist:

Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
I have written unit and/or e2e tests for my change. PRs without these are unlike to be merged.
Optional. I've added My organization is added to the USERS.md.
I've signed the CLA and required builds are green.

alexec · 2020-04-06T16:13:11Z

@simster7 do you want to take a look?

codecov · 2020-04-06T16:23:45Z

Codecov Report

Merging #2605 into master will decrease coverage by 0.00%.
The diff coverage is 14.28%.

@@            Coverage Diff             @@
##           master    #2605      +/-   ##
==========================================
- Coverage   11.16%   11.16%   -0.01%     
==========================================
  Files          83       83              
  Lines       32673    32675       +2     
==========================================
- Hits         3649     3648       -1     
- Misses      28525    28530       +5     
+ Partials      499      497       -2

Impacted Files	Coverage Δ
persist/sqldb/workflow_archive.go	`0.00% <0.00%> (ø)`
workflow/controller/exec_control.go	`22.10% <0.00%> (-2.63%)`	⬇️
workflow/controller/workflowpod.go	`70.87% <100.00%> (+0.13%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffc43ce...c6f8f48. Read the comment docs.

simster7 · 2020-04-06T16:28:13Z

@simster7 do you want to take a look?

Yes, please don't merge until I've had a chance to look

simster7

I'm also unable to reproduce this behavior, I used the Workflow you provided in the issue, the Workflow you provided in this test case, and even this Workflow to attempt to reproduce:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: exit-handler-sleep
spec:
  entrypoint: intentional-fail
  activeDeadlineSeconds: 10
  onExit: exit-handler
  templates:
  - name: intentional-fail
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo intentional failure; sleep 20; exit 1"]
  - name: exit-handler
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo send e-mail: {{workflow.name}} {{workflow.status}}."]

They all finish.

simster7 · 2020-04-06T18:58:03Z

workflow/controller/workflowpod.go

@@ -107,7 +107,7 @@ func (woc *wfOperationCtx) createWorkflowPod(nodeName string, mainCtr apiv1.Cont
 } else {
 wfActiveDeadlineSeconds := int64((*wfDeadline).Sub(time.Now().UTC()).Seconds())
 if wfActiveDeadlineSeconds < 0 {
- return nil, nil
+ return nil, fmt.Errorf("Scheduling pod after workflow deadline %s", wfDeadline)


Hmmm, I'm not sure this should be an error.

For one, this would make it such that no Pods (even those in onExit nodes) could be scheduled after the deadline has passed. I'm not entirely convinced, but I think the default behavior should be that onExit nodes still run after the deadline has passed. I'm thinking of teardown code that needs to be run regardless of timeouts.

Secondly, returning an error here would guarantee that the Workflow finishes with an Error phase. I don't think this should be the case. If a workflow fails because of activeDeadlineSeconds, it should finish with a Failed phase as the cause was internal to the Workflow executing and was not an error of executing the Workflow itself.

markterm · 2020-04-07T14:54:33Z

I had a play and if the exit handler runs quickly enough then wfActiveDeadlineSeconds is zero at workflowpod.go:109 in which case I get an error like:

Pod "exit-handler-sleep-2733314521" is invalid: spec.activeDeadlineSeconds: Invalid value: 0: must be between 1 and 2147483647, inclusive

If at least a second passes before it reaches workflowpod.go:109 then wfActiveDeadlineSeconds is negative and workflowpod.go:110 is triggered.

If we're fine with deadline not applying to onExit, how about we modify it to:

	wfDeadline := woc.getWorkflowDeadline()
	if wfDeadline == nil || opts.onExitPod { //ignore the workflow deadline for exit handler so they still run if the deadline has passed
		activeDeadlineSeconds = tmpl.ActiveDeadlineSeconds
	} else {
		wfActiveDeadlineSeconds := int64((*wfDeadline).Sub(time.Now().UTC()).Seconds())
		if wfActiveDeadlineSeconds <= 0 {
			return nil, nil
		} else if tmpl.ActiveDeadlineSeconds == nil || wfActiveDeadlineSeconds < *tmpl.ActiveDeadlineSeconds {
			activeDeadlineSeconds = &wfActiveDeadlineSeconds
		} else {
			activeDeadlineSeconds = tmpl.ActiveDeadlineSeconds
		}
	}

simster7 · 2020-04-07T15:00:12Z

@mark9white Ah yes, that would be the correct fix if we wanted to go that route.

Let me bring it up with the team today and get feedback on whether onExit should run after activeDeadlineSeconds has passed. I'll let you know what they think and we can go with the approach that is applicable.

markterm · 2020-04-07T15:02:18Z

Great. My 2 cents is that it should still run, but applying deadlines set within it.

…

On Tue, 7 Apr 2020 at 16:00, Simon Behar ***@***.***> wrote: @mark9white <https://github.com/mark9white> Ah yes, that would be the correct fix if we wanted to go that route. Let me bring it up with the team today and get feedback on whether onExit should run after activeDeadlineSeconds has passed. I'll let you know what they think and we can go with the approach that is applicable. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2605 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABO7LWHEAFJLNZU336SULDRLM5YZANCNFSM4MCM2MPQ> .

simster7 · 2020-04-07T16:10:08Z

Great. My 2 cents is that it should still run, but applying deadlines set
within it.

I'll definitely bring this up!

markterm · 2020-04-07T17:06:47Z

Thinking about the:

return nil, nil

So this never creates a pod, we end up with a pending node, what would clean up that node? Normally it would be when the pod is iterated by execution control, but as we never created one that doesn't happen ...

simster7 · 2020-04-09T01:31:58Z

Hey @mark9white, we are going with the approach where the onExit handler still runs after the activeDeadlineSeconds is expired.

Thinking about the:
nil, nil

Happens here:

https://github.com/argoproj/argo/blob/8c29e05cb5befe5f9f0263ff138eab66f75c54d0/workflow/controller/operator.go#L783-L799

…ixes argoproj#2603

markterm · 2020-04-09T08:09:22Z

Great, I've just adjusted this PR so it doesn't time out onExit pods.

The nil, nil results in a Pending node, which I'm afraid the code you pasted skips over in line 790.

markterm · 2020-04-09T13:05:59Z

I had to trigger build because of a failure in TestCLISuite/TestLogProblems but as far as I can see it was a flake and has no relation to this PR.

So from my pov, this is ready to squash and merge.

simster7 · 2020-04-09T15:12:00Z

The nil, nil results in a Pending node, which I'm afraid the code you pasted skips over in line 790.

Oh you're right... seems to be a fairly recent change: https://github.com/argoproj/argo/pull/2385/files#diff-fcb04129f1d32e69cca32631c6586587R757 that I'd forgotten about. Good to have in mind in case any issues arise

…2603 (#2605)

simster7 · 2020-04-10T18:49:14Z

Back-ported to 2.7

Ark-kun · 2020-04-30T04:06:51Z

get feedback on whether onExit should run after activeDeadlineSeconds has passed.

We had feedback from our users that they're surprised that onExit handlers do not run when Workflow is terminated. onExit handlers usually has cleanup routines (cluster deletion, etc).

simster7 · 2020-05-02T01:35:08Z

We had feedback from our users that they're surprised that onExit handlers do not run when Workflow is terminated. onExit handlers usually has cleanup routines (cluster deletion, etc).

FYI: Argo 2.6+ now has a Stop command that terminates the Workflow and runs all onExit handlers.

alexec changed the title ~~onExit step gets stuck if the workflow exceeds activeDeadlineSeconds fixes #2603~~ fix: onExit step gets stuck if the workflow exceeds activeDeadlineSeconds. Fixes #2603 Apr 6, 2020

alexec linked an issue Apr 6, 2020 that may be closed by this pull request

onExit step gets stuck if the workflow exceeds activeDeadlineSeconds #2603

Closed

3 tasks

alexec approved these changes Apr 6, 2020

View reviewed changes

alexec requested a review from simster7 April 6, 2020 16:13

simster7 suggested changes Apr 6, 2020

View reviewed changes

markterm added 2 commits April 9, 2020 09:08

onExit step gets stuck if the workflow exceeds activeDeadlineSeconds f…

e71a82b

…ixes argoproj#2603

Don't apply workflow deadline to exit nodes

6803e6c

markterm force-pushed the onexit-stuck-on-deadline branch from ed5d3ec to 6803e6c Compare April 9, 2020 08:08

markterm added 3 commits April 9, 2020 09:32

Revert back to nil pod if execution is attempted after workflow deadline

c06990f

Test fix

dd40f2b

Trigger build

c6f8f48

simster7 approved these changes Apr 9, 2020

View reviewed changes

simster7 merged commit 6c685c5 into argoproj:master Apr 9, 2020

simster7 pushed a commit that referenced this pull request Apr 10, 2020

fix: allow onExit to run if wf exceeds activeDeadlineSeconds. Fixes #…

a28fc4f

…2603 (#2605)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: onExit step gets stuck if the workflow exceeds activeDeadlineSeconds. Fixes #2603 #2605

fix: onExit step gets stuck if the workflow exceeds activeDeadlineSeconds. Fixes #2603 #2605

markterm commented Apr 6, 2020

alexec commented Apr 6, 2020

codecov bot commented Apr 6, 2020 •

edited

Loading

simster7 commented Apr 6, 2020

simster7 left a comment

simster7 Apr 6, 2020

markterm commented Apr 7, 2020

simster7 commented Apr 7, 2020

markterm commented Apr 7, 2020 via email

simster7 commented Apr 7, 2020

markterm commented Apr 7, 2020

simster7 commented Apr 9, 2020

markterm commented Apr 9, 2020

markterm commented Apr 9, 2020

simster7 commented Apr 9, 2020

simster7 commented Apr 10, 2020

Ark-kun commented Apr 30, 2020

simster7 commented May 2, 2020

fix: onExit step gets stuck if the workflow exceeds activeDeadlineSeconds. Fixes #2603 #2605

fix: onExit step gets stuck if the workflow exceeds activeDeadlineSeconds. Fixes #2603 #2605

Conversation

markterm commented Apr 6, 2020

alexec commented Apr 6, 2020

codecov bot commented Apr 6, 2020 • edited Loading

Codecov Report

simster7 commented Apr 6, 2020

simster7 left a comment

Choose a reason for hiding this comment

simster7 Apr 6, 2020

Choose a reason for hiding this comment

markterm commented Apr 7, 2020

simster7 commented Apr 7, 2020

markterm commented Apr 7, 2020 via email

simster7 commented Apr 7, 2020

markterm commented Apr 7, 2020

simster7 commented Apr 9, 2020

markterm commented Apr 9, 2020

markterm commented Apr 9, 2020

simster7 commented Apr 9, 2020

simster7 commented Apr 10, 2020

Ark-kun commented Apr 30, 2020

simster7 commented May 2, 2020

codecov bot commented Apr 6, 2020 •

edited

Loading