Issue #896 - Prometheus metrics and telemetry #935
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implements #896
This PR instruments argo with basic workflow metrics and golang telemetry. It's heavily influenced by https://github.com/kubernetes/kube-state-metrics and is disabled by default. It has to be explicitly configured in the ConfigMap, e.g.:
I made it use the same informer and allow a user to override the workflow resync period to get metrics more often ( 1 minute I think would be optimal). The other option I am thinking about is running it in a sidecar, but it makes it more difficult to make sure configs do not diverge and use the same label selectors. It also seems reasonable that the controller itself reports the metrics and I can see that extended into a more detailed node transition metrics at some point.
I am looking for feedback and still working on some doc updates.
Example metrics reported:
Example telemetry reported: