Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: self-reporting workflow progress. Fixes #1658, #4245 #6714

Merged
merged 1 commit into from
Nov 3, 2021

Conversation

mweibel
Copy link
Contributor

@mweibel mweibel commented Sep 13, 2021

port of argo-workflow pod progress code to current codebase and extended to handle EOF streams.

most of the code is by @alexec from: https://github.com/argoproj/argo-workflows/pull/4015/files
closes #1658, #4245

I just started to test this code in my cluster, therefore I'd like to wait with considering this PR ready until I'm more confident.
An initial review would be great though!

I didn't implement the part with reporting also a progress message as the whole pod status reporting got changed meanwhile and I'm not sure if we need it anyway.

Thanks :)

Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my original PoC has some issues:

  • It requires tailing the pod logs, and that is expensive. Instead we should build on the pattern that the Emissary uses. The process should write a file to a shared volume (e.g. /var/run/argo/progress) to indicate its progress. I know this is not as user friendly as printing to log files, but I think logs files is too expensive.
  • We want to move away from requiring pod patch permission, to some other mechanism. However, that's a big piece of work, not something to do in this PR. I think we can still run progress on a ticker, but the polling period should be configurable (using env var) maybe starting with 1m rather than 30s. It should only patch if the progress has changed.

@mweibel
Copy link
Contributor Author

mweibel commented Sep 14, 2021

Thanks for the comment. I did wonder about what your thoughts were so that clears things up!
Happy to look into doing the progress monitoring in another way and might come back with questions once I get to it.

@alexec
Copy link
Contributor

alexec commented Sep 28, 2021

Some additional thoughts on this:

  • We want to minimise reads and writes to the Kubernetes API. These create churn on etc and can impact the entire cluster's stability. A workflow that has 10k pods that is suddenly doing 10k reads and write every 30s due to this feature would be a problem.
  • I can see that it is possible to only do writes, and do them as soon as needed. Poll /var/run/argo/progress for changes every 3s, if the file has changed since the last patch, and it is at least 30s since the last pod patch, do the patch. Otherwise, do nothing.

@mweibel
Copy link
Contributor Author

mweibel commented Sep 29, 2021

when doing file based progress watching the question is a bit how to do that for other executors than emissary. Should I mount an emptyDir similar to what emissary does when creating the pod (if not emissary executor) and specify using e.g. an env variable what file to watch for progress?

@alexec
Copy link
Contributor

alexec commented Sep 29, 2021

Should I mount an emptyDir similar to what emissary does when creating the pod

Yes. It only gets mounted for the emissary today, but we can mount it for all pods.

@mweibel
Copy link
Contributor Author

mweibel commented Sep 30, 2021

as I'm continuing working on this I came accross the podReconciliation() function which retrieves all pods and assesses the status for each one.

Would it make sense to move the updating of progress to this function (within performAssessment most likely)?

Reason why I came across this: due to #6647 I need to change the way I access the pods via the informer because they have different keys in the index. Just querying the podInformer using the namespace/nodeID combination doesn't work anymore and I wondered how I could best query the pods. Passing the return value of woc.getAllWorkflowPods() instead of the informer to the updateProgress function is an idea but has some other issues. Maybe adjusting the podReconciliation function makes more sense.

@mweibel mweibel force-pushed the custom-workflow-progress branch 2 times, most recently from e708e5e to c31bbcd Compare September 30, 2021 14:48
@mweibel mweibel requested a review from alexec September 30, 2021 14:48
@mweibel
Copy link
Contributor Author

mweibel commented Sep 30, 2021

re-requested a review to see if that goes into the right direction.

Several things still need to be done:

  • make tickers/timeouts configurable. ConfigMap? Env variable? needs to be discussed, not sure what's best for this case
  • make watching for progress configurable too. It probably adds some load to a system in any case, so I think it should be an optional feature
  • decide about progress message. Do we want to support that or not? If so, how?
  • go over variable naming in monitorProgress again, podPatchTickerTime, progressPollDuration are names which I dislike currently
  • check if there's a better way to discover if pods are running than the approach in monitorProgress currently
  • unit test fails, check if that's related to progress feature
  • e2e test fails, will need to fix it. I haven't been able to run the e2e tests locally so that needs to be done first

@mweibel
Copy link
Contributor Author

mweibel commented Oct 1, 2021

another TODO (mostly for me, but if anyone has inputs on those todos, please chime in!)

  • should we stat the file every 3s and check if modified instead of reading it every 3s?

@mweibel mweibel force-pushed the custom-workflow-progress branch 2 times, most recently from 9b6f526 to 64987e3 Compare October 1, 2021 15:23
workflow/executor/executor.go Outdated Show resolved Hide resolved
workflow/progress/updater.go Outdated Show resolved Hide resolved
workflow/controller/operator.go Outdated Show resolved Hide resolved
@mweibel mweibel force-pushed the custom-workflow-progress branch 3 times, most recently from 634bbad to 2accf4c Compare October 5, 2021 14:30
config/config.go Outdated Show resolved Hide resolved
Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking great! I'm going to keep the comments coming

cmd/argoexec/commands/root.go Outdated Show resolved Hide resolved
workflow/controller/operator.go Show resolved Hide resolved
workflow/controller/operator.go Outdated Show resolved Hide resolved
workflow/controller/operator_test.go Show resolved Hide resolved
workflow/controller/workflowpod.go Outdated Show resolved Hide resolved
workflow/executor/executor.go Outdated Show resolved Hide resolved
workflow/executor/executor.go Outdated Show resolved Hide resolved
workflow/executor/executor.go Outdated Show resolved Hide resolved
workflow/executor/executor.go Show resolved Hide resolved
workflow/executor/executor.go Outdated Show resolved Hide resolved
@mweibel mweibel force-pushed the custom-workflow-progress branch 4 times, most recently from 784bd4c to 63a2da0 Compare October 7, 2021 13:34
@codecov
Copy link

codecov bot commented Oct 7, 2021

Codecov Report

Merging #6714 (950f234) into master (02165aa) will increase coverage by 0.16%.
The diff coverage is 59.55%.

❗ Current head 950f234 differs from pull request most recent head 98b8d56. Consider uploading reports for the commit 98b8d56 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6714      +/-   ##
==========================================
+ Coverage   48.58%   48.75%   +0.16%     
==========================================
  Files         267      265       -2     
  Lines       19457    19395      -62     
==========================================
+ Hits         9454     9456       +2     
+ Misses       8946     8881      -65     
- Partials     1057     1058       +1     
Impacted Files Coverage Δ
cmd/argoexec/commands/root.go 2.63% <0.00%> (-0.08%) ⬇️
workflow/common/common.go 80.00% <ø> (ø)
workflow/controller/controller.go 22.53% <0.00%> (ø)
workflow/progress/updater.go 65.78% <21.42%> (-27.76%) ⬇️
workflow/executor/executor.go 30.93% <51.25%> (+6.82%) ⬆️
pkg/apis/workflow/v1alpha1/progress.go 90.00% <100.00%> (+1.11%) ⬆️
workflow/controller/operator.go 71.53% <100.00%> (+0.14%) ⬆️
workflow/controller/workflowpod.go 74.45% <100.00%> (+0.55%) ⬆️
server/auth/gatekeeper.go 50.00% <0.00%> (-8.38%) ⬇️
workflow/util/pod_name.go 46.66% <0.00%> (-1.34%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02165aa...98b8d56. Read the comment docs.

@mweibel
Copy link
Contributor Author

mweibel commented Oct 7, 2021

To be able to test the progress feature in an e2e test I need to adjust the workflow-controller-configmap for that specific test. I couldn't find an example how to do that, is this not yet a case covered?

docs/progress.md Outdated Show resolved Hide resolved
workflow/controller/workflowpod.go Outdated Show resolved Hide resolved
docs/progress.md Outdated Show resolved Hide resolved
workflow/executor/executor.go Show resolved Hide resolved
workflow/executor/executor_test.go Show resolved Hide resolved
@mweibel mweibel force-pushed the custom-workflow-progress branch 2 times, most recently from 94a412a to 11a7e34 Compare October 26, 2021 14:58
@alexec
Copy link
Contributor

alexec commented Oct 26, 2021

the e2e failure is due to the fact that I had to adjust argosay to be able to specify a custom sleep duration when doing an echo. This would need a new build of argosay before doing the e2e test I believe. Could anyone do that or recommend a different approach?

I would just use a script to do this rather then force a new argosay image. Argosay does contain a shell.

Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to push you to simplify.

Try syncing with master to fix your builds.

workflow/controller/workflowpod_test.go Outdated Show resolved Hide resolved
workflow/executor/executor.go Show resolved Hide resolved
workflow/executor/executor.go Outdated Show resolved Hide resolved
workflow/executor/executor.go Show resolved Hide resolved
workflow/executor/executor.go Outdated Show resolved Hide resolved
workflow/executor/executor.go Outdated Show resolved Hide resolved
Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look great btw!

@alexec alexec changed the title feat: self reporting workflow progress feat: self-reporting workflow progress. Fixes #1658, #4245 Oct 26, 2021
@mweibel mweibel requested a review from alexec October 28, 2021 10:12
@mweibel mweibel force-pushed the custom-workflow-progress branch 2 times, most recently from 51fee1b to be986ab Compare October 28, 2021 12:26
Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • One comment on what looks like a bug.
  • Please sync with master.
  • Please dismiss this review when done.

workflow/controller/workflowpod.go Outdated Show resolved Hide resolved
@mweibel mweibel force-pushed the custom-workflow-progress branch 2 times, most recently from 83dc731 to 966e677 Compare November 1, 2021 07:41
@mweibel mweibel requested a review from alexec November 1, 2021 08:47
@mweibel mweibel force-pushed the custom-workflow-progress branch 4 times, most recently from 2ccc723 to 747e2e8 Compare November 3, 2021 10:27
@@ -77,6 +79,12 @@ type WorkflowExecutor struct {
// list of errors that occurred during execution.
// the first of these is used as the overall message of the node
errors []error

// current progress which is synced every `annotationPatchTickDuration` to the pods annotations.
progress string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor - could be Progress type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make a PR to change this, probably tomorrow 👍

@alexec alexec merged commit 7886a2b into argoproj:master Nov 3, 2021
@simster7
Copy link
Member

simster7 commented Nov 3, 2021

This is pretty cool! 🚀

@alexec
Copy link
Contributor

alexec commented Nov 3, 2021

I think this is a neat feature, as esp useful for CI use cases, or long running tasks.

@mweibel
Copy link
Contributor Author

mweibel commented Nov 3, 2021

amazing, thanks for merging and the great reviews @alexec! We started using this already and it works well so far (though we used a to version 3.2 backported branch of it). Looking forward to the release :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflows self-reporting N/M progress Support progress and/or rate indication
4 participants