Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: onExit step gets stuck if the workflow exceeds activeDeadlineSeconds. Fixes #2603 #2605

Merged
merged 5 commits into from
Apr 9, 2020

Conversation

markterm
Copy link
Contributor

@markterm markterm commented Apr 6, 2020

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I have written unit and/or e2e tests for my change. PRs without these are unlike to be merged.
  • Optional. I've added My organization is added to the USERS.md.
  • I've signed the CLA and required builds are green.

@alexec alexec changed the title onExit step gets stuck if the workflow exceeds activeDeadlineSeconds fixes #2603 fix: onExit step gets stuck if the workflow exceeds activeDeadlineSeconds. Fixes #2603 Apr 6, 2020
@alexec alexec linked an issue Apr 6, 2020 that may be closed by this pull request
3 tasks
@alexec alexec requested a review from simster7 April 6, 2020 16:13
@alexec
Copy link
Contributor

alexec commented Apr 6, 2020

@simster7 do you want to take a look?

@codecov
Copy link

codecov bot commented Apr 6, 2020

Codecov Report

Merging #2605 into master will decrease coverage by 0.00%.
The diff coverage is 14.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2605      +/-   ##
==========================================
- Coverage   11.16%   11.16%   -0.01%     
==========================================
  Files          83       83              
  Lines       32673    32675       +2     
==========================================
- Hits         3649     3648       -1     
- Misses      28525    28530       +5     
+ Partials      499      497       -2     
Impacted Files Coverage Δ
persist/sqldb/workflow_archive.go 0.00% <0.00%> (ø)
workflow/controller/exec_control.go 22.10% <0.00%> (-2.63%) ⬇️
workflow/controller/workflowpod.go 70.87% <100.00%> (+0.13%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffc43ce...c6f8f48. Read the comment docs.

@simster7
Copy link
Member

simster7 commented Apr 6, 2020

@simster7 do you want to take a look?

Yes, please don't merge until I've had a chance to look

Copy link
Member

@simster7 simster7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also unable to reproduce this behavior, I used the Workflow you provided in the issue, the Workflow you provided in this test case, and even this Workflow to attempt to reproduce:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: exit-handler-sleep
spec:
  entrypoint: intentional-fail
  activeDeadlineSeconds: 10
  onExit: exit-handler
  templates:
  - name: intentional-fail
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo intentional failure; sleep 20; exit 1"]
  - name: exit-handler
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo send e-mail: {{workflow.name}} {{workflow.status}}."]

They all finish.

@@ -107,7 +107,7 @@ func (woc *wfOperationCtx) createWorkflowPod(nodeName string, mainCtr apiv1.Cont
} else {
wfActiveDeadlineSeconds := int64((*wfDeadline).Sub(time.Now().UTC()).Seconds())
if wfActiveDeadlineSeconds < 0 {
return nil, nil
return nil, fmt.Errorf("Scheduling pod after workflow deadline %s", wfDeadline)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I'm not sure this should be an error.

For one, this would make it such that no Pods (even those in onExit nodes) could be scheduled after the deadline has passed. I'm not entirely convinced, but I think the default behavior should be that onExit nodes still run after the deadline has passed. I'm thinking of teardown code that needs to be run regardless of timeouts.

Secondly, returning an error here would guarantee that the Workflow finishes with an Error phase. I don't think this should be the case. If a workflow fails because of activeDeadlineSeconds, it should finish with a Failed phase as the cause was internal to the Workflow executing and was not an error of executing the Workflow itself.

@markterm
Copy link
Contributor Author

markterm commented Apr 7, 2020

I had a play and if the exit handler runs quickly enough then wfActiveDeadlineSeconds is zero at workflowpod.go:109 in which case I get an error like:

Pod "exit-handler-sleep-2733314521" is invalid: spec.activeDeadlineSeconds: Invalid value: 0: must be between 1 and 2147483647, inclusive

If at least a second passes before it reaches workflowpod.go:109 then wfActiveDeadlineSeconds is negative and workflowpod.go:110 is triggered.

If we're fine with deadline not applying to onExit, how about we modify it to:

	wfDeadline := woc.getWorkflowDeadline()
	if wfDeadline == nil || opts.onExitPod { //ignore the workflow deadline for exit handler so they still run if the deadline has passed
		activeDeadlineSeconds = tmpl.ActiveDeadlineSeconds
	} else {
		wfActiveDeadlineSeconds := int64((*wfDeadline).Sub(time.Now().UTC()).Seconds())
		if wfActiveDeadlineSeconds <= 0 {
			return nil, nil
		} else if tmpl.ActiveDeadlineSeconds == nil || wfActiveDeadlineSeconds < *tmpl.ActiveDeadlineSeconds {
			activeDeadlineSeconds = &wfActiveDeadlineSeconds
		} else {
			activeDeadlineSeconds = tmpl.ActiveDeadlineSeconds
		}
	}

@simster7
Copy link
Member

simster7 commented Apr 7, 2020

@mark9white Ah yes, that would be the correct fix if we wanted to go that route.

Let me bring it up with the team today and get feedback on whether onExit should run after activeDeadlineSeconds has passed. I'll let you know what they think and we can go with the approach that is applicable.

@markterm
Copy link
Contributor Author

markterm commented Apr 7, 2020 via email

@simster7
Copy link
Member

simster7 commented Apr 7, 2020

Great. My 2 cents is that it should still run, but applying deadlines set
within it.

I'll definitely bring this up!

@markterm
Copy link
Contributor Author

markterm commented Apr 7, 2020

Thinking about the:

return nil, nil

So this never creates a pod, we end up with a pending node, what would clean up that node? Normally it would be when the pod is iterated by execution control, but as we never created one that doesn't happen ...

@simster7
Copy link
Member

simster7 commented Apr 9, 2020

Hey @mark9white, we are going with the approach where the onExit handler still runs after the activeDeadlineSeconds is expired.

Thinking about the:
nil, nil

Happens here:

https://github.com/argoproj/argo/blob/8c29e05cb5befe5f9f0263ff138eab66f75c54d0/workflow/controller/operator.go#L783-L799

@markterm
Copy link
Contributor Author

markterm commented Apr 9, 2020

Great, I've just adjusted this PR so it doesn't time out onExit pods.

The nil, nil results in a Pending node, which I'm afraid the code you pasted skips over in line 790.

@markterm
Copy link
Contributor Author

markterm commented Apr 9, 2020

I had to trigger build because of a failure in TestCLISuite/TestLogProblems but as far as I can see it was a flake and has no relation to this PR.

So from my pov, this is ready to squash and merge.

@simster7
Copy link
Member

simster7 commented Apr 9, 2020

The nil, nil results in a Pending node, which I'm afraid the code you pasted skips over in line 790.

Oh you're right... seems to be a fairly recent change: https://github.com/argoproj/argo/pull/2385/files#diff-fcb04129f1d32e69cca32631c6586587R757 that I'd forgotten about. Good to have in mind in case any issues arise

@simster7 simster7 merged commit 6c685c5 into argoproj:master Apr 9, 2020
@simster7
Copy link
Member

Back-ported to 2.7

@Ark-kun
Copy link
Member

Ark-kun commented Apr 30, 2020

get feedback on whether onExit should run after activeDeadlineSeconds has passed.

We had feedback from our users that they're surprised that onExit handlers do not run when Workflow is terminated. onExit handlers usually has cleanup routines (cluster deletion, etc).

@simster7
Copy link
Member

simster7 commented May 2, 2020

We had feedback from our users that they're surprised that onExit handlers do not run when Workflow is terminated. onExit handlers usually has cleanup routines (cluster deletion, etc).

FYI: Argo 2.6+ now has a Stop command that terminates the Workflow and runs all onExit handlers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

onExit step gets stuck if the workflow exceeds activeDeadlineSeconds
4 participants