Skip to content

Commit

Permalink
feat(controller): Workflow-level retryStrategy/resubmit pending pod…
Browse files Browse the repository at this point in the history
…s by default. Closes argoproj#3918 (argoproj#3965)
  • Loading branch information
alexec committed Sep 21, 2020
1 parent d7a297c commit fdf0b05
Show file tree
Hide file tree
Showing 24 changed files with 892 additions and 614 deletions.
12 changes: 8 additions & 4 deletions api/openapi-spec/swagger.json
Original file line number Diff line number Diff line change
Expand Up @@ -3769,10 +3769,6 @@
"description": "Resource template subtype which can run k8s resources",
"$ref": "#/definitions/io.argoproj.workflow.v1alpha1.ResourceTemplate"
},
"resubmitPendingPods": {
"description": "ResubmitPendingPods is a flag to enable resubmitting pods that remain Pending after initial submission",
"type": "boolean"
},
"retryStrategy": {
"description": "RetryStrategy describes how to retry a template when it fails",
"$ref": "#/definitions/io.argoproj.workflow.v1alpha1.RetryStrategy"
Expand Down Expand Up @@ -4412,6 +4408,10 @@
"type": "integer",
"format": "int32"
},
"retryStrategy": {
"description": "RetryStrategy for all templates in the io.argoproj.workflow.v1alpha1.",
"$ref": "#/definitions/io.argoproj.workflow.v1alpha1.RetryStrategy"
},
"schedulerName": {
"description": "Set scheduler name for all pods. Will be overridden if container/script template's scheduler name is set. Default scheduler will be used if neither specified.",
"type": "string"
Expand Down Expand Up @@ -4867,6 +4867,10 @@
"type": "integer",
"format": "int32"
},
"retryStrategy": {
"description": "RetryStrategy for all templates in the io.argoproj.workflow.v1alpha1.",
"$ref": "#/definitions/io.argoproj.workflow.v1alpha1.RetryStrategy"
},
"schedulerName": {
"description": "Set scheduler name for all pods. Will be overridden if container/script template's scheduler name is set. Default scheduler will be used if neither specified.",
"type": "string"
Expand Down
123 changes: 68 additions & 55 deletions docs/fields.md
Original file line number Diff line number Diff line change
Expand Up @@ -635,6 +635,7 @@ WorkflowSpec is the specification of a Workflow.
|`podPriorityClassName`|`string`|PriorityClassName to apply to workflow pods.|
|`podSpecPatch`|`string`|PodSpecPatch holds strategic merge patch to apply against the pod spec. Allows parameterization of container fields which are not strings (e.g. resource limits).|
|`priority`|`int32`|Priority is used if controller is configured to process limited number of workflows in parallel. Workflows with higher priority are processed first.|
|`retryStrategy`|[`RetryStrategy`](#retrystrategy)|RetryStrategy for all templates in the io.argoproj.workflow.v1alpha1.|
|`schedulerName`|`string`|Set scheduler name for all pods. Will be overridden if container/script template's scheduler name is set. Default scheduler will be used if neither specified.|
|`securityContext`|[`PodSecurityContext`](#podsecuritycontext)|SecurityContext holds pod-level security attributes and common container settings. Optional: Defaults to empty. See type description for default values of each field.|
|`serviceAccountName`|`string`|ServiceAccountName is the name of the ServiceAccount to run all pods of the workflow as.|
Expand Down Expand Up @@ -1287,6 +1288,7 @@ WorkflowTemplateSpec is a spec of WorkflowTemplate.
|`podPriorityClassName`|`string`|PriorityClassName to apply to workflow pods.|
|`podSpecPatch`|`string`|PodSpecPatch holds strategic merge patch to apply against the pod spec. Allows parameterization of container fields which are not strings (e.g. resource limits).|
|`priority`|`int32`|Priority is used if controller is configured to process limited number of workflows in parallel. Workflows with higher priority are processed first.|
|`retryStrategy`|[`RetryStrategy`](#retrystrategy)|RetryStrategy for all templates in the io.argoproj.workflow.v1alpha1.|
|`schedulerName`|`string`|Set scheduler name for all pods. Will be overridden if container/script template's scheduler name is set. Default scheduler will be used if neither specified.|
|`securityContext`|[`PodSecurityContext`](#podsecuritycontext)|SecurityContext holds pod-level security attributes and common container settings. Optional: Defaults to empty. See type description for default values of each field.|
|`serviceAccountName`|`string`|ServiceAccountName is the name of the ServiceAccount to run all pods of the workflow as.|
Expand Down Expand Up @@ -1522,6 +1524,40 @@ PodGC describes how to delete completed pods as they complete
|:----------:|:----------:|---------------|
|`strategy`|`string`|Strategy is the strategy to use. One of "OnPodCompletion", "OnPodSuccess", "OnWorkflowCompletion", "OnWorkflowSuccess"|

## RetryStrategy

RetryStrategy provides controls on how to retry a workflow step

<details>
<summary>Examples with this field (click to open)</summary>
<br>

- [`clustertemplates.yaml`](https://github.com/argoproj/argo/blob/master/examples/cluster-workflow-template/clustertemplates.yaml)

- [`dag-disable-failFast.yaml`](https://github.com/argoproj/argo/blob/master/examples/dag-disable-failFast.yaml)

- [`retry-backoff.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-backoff.yaml)

- [`retry-container-to-completion.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-container-to-completion.yaml)

- [`retry-container.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-container.yaml)

- [`retry-on-error.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-on-error.yaml)

- [`retry-script.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-script.yaml)

- [`retry-with-steps.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-with-steps.yaml)

- [`templates.yaml`](https://github.com/argoproj/argo/blob/master/examples/workflow-template/templates.yaml)
</details>

### Fields
| Field Name | Field Type | Description |
|:----------:|:----------:|---------------|
|`backoff`|[`Backoff`](#backoff)|Backoff is a backoff strategy|
|`limit`|[`IntOrString`](#intorstring)|Limit is the maximum number of attempts when retrying a container|
|`retryPolicy`|`string`|RetryPolicy is a policy of NodePhase statuses that will be retried|

## Synchronization

Synchronization holds synchronization lock configuration
Expand Down Expand Up @@ -1844,7 +1880,6 @@ Template is a reusable and composable unit of execution in a workflow
|`priority`|`int32`|Priority to apply to workflow pods.|
|`priorityClassName`|`string`|PriorityClassName to apply to workflow pods.|
|`resource`|[`ResourceTemplate`](#resourcetemplate)|Resource template subtype which can run k8s resources|
|`resubmitPendingPods`|`boolean`|ResubmitPendingPods is a flag to enable resubmitting pods that remain Pending after initial submission|
|`retryStrategy`|[`RetryStrategy`](#retrystrategy)|RetryStrategy describes how to retry a template when it fails|
|`schedulerName`|`string`|If specified, the pod will be dispatched by specified scheduler. Or it will be dispatched by workflow scope scheduler if specified. If neither specified, the pod will be dispatched by default scheduler.|
|`script`|[`ScriptTemplate`](#scripttemplate)|Script runs a portion of code against an interpreter|
Expand Down Expand Up @@ -2305,6 +2340,24 @@ Prometheus is a prometheus metric to be emitted
|`name`|`string`|Name is the name of the metric|
|`when`|`string`|When is a conditional statement that decides when to emit the metric|

## Backoff

Backoff is a backoff strategy to use within retryStrategy

<details>
<summary>Examples with this field (click to open)</summary>
<br>

- [`retry-backoff.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-backoff.yaml)
</details>

### Fields
| Field Name | Field Type | Description |
|:----------:|:----------:|---------------|
|`duration`|`string`|Duration is the amount to back off. Default unit is seconds, but could also be a duration (e.g. "2m", "1h")|
|`factor`|[`IntOrString`](#intorstring)|Factor is a factor to multiply the base duration after each failed retry|
|`maxDuration`|`string`|MaxDuration is the maximum amount of time allowed for the backoff strategy|

## Mutex

Mutex holds Mutex configuration
Expand Down Expand Up @@ -2956,40 +3009,6 @@ ResourceTemplate is a template subtype to manipulate kubernetes resources
|`setOwnerReference`|`boolean`|SetOwnerReference sets the reference to the workflow on the OwnerReference of generated resource.|
|`successCondition`|`string`|SuccessCondition is a label selector expression which describes the conditions of the k8s resource in which it is acceptable to proceed to the following step|

## RetryStrategy

RetryStrategy provides controls on how to retry a workflow step

<details>
<summary>Examples with this field (click to open)</summary>
<br>

- [`clustertemplates.yaml`](https://github.com/argoproj/argo/blob/master/examples/cluster-workflow-template/clustertemplates.yaml)

- [`dag-disable-failFast.yaml`](https://github.com/argoproj/argo/blob/master/examples/dag-disable-failFast.yaml)

- [`retry-backoff.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-backoff.yaml)

- [`retry-container-to-completion.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-container-to-completion.yaml)

- [`retry-container.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-container.yaml)

- [`retry-on-error.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-on-error.yaml)

- [`retry-script.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-script.yaml)

- [`retry-with-steps.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-with-steps.yaml)

- [`templates.yaml`](https://github.com/argoproj/argo/blob/master/examples/workflow-template/templates.yaml)
</details>

### Fields
| Field Name | Field Type | Description |
|:----------:|:----------:|---------------|
|`backoff`|[`Backoff`](#backoff)|Backoff is a backoff strategy|
|`limit`|[`IntOrString`](#intorstring)|Limit is the maximum number of attempts when retrying a container|
|`retryPolicy`|`string`|RetryPolicy is a policy of NodePhase statuses that will be retried|

## ScriptTemplate

ScriptTemplate is a template subtype to enable scripting through code steps
Expand Down Expand Up @@ -3739,24 +3758,6 @@ _No description available_
|:----------:|:----------:|---------------|
|`configMap`|[`ConfigMapKeySelector`](#configmapkeyselector)|_No description available_|

## Backoff

Backoff is a backoff strategy to use within retryStrategy

<details>
<summary>Examples with this field (click to open)</summary>
<br>

- [`retry-backoff.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-backoff.yaml)
</details>

### Fields
| Field Name | Field Type | Description |
|:----------:|:----------:|---------------|
|`duration`|`string`|Duration is the amount to back off. Default unit is seconds, but could also be a duration (e.g. "2m", "1h")|
|`factor`|[`IntOrString`](#intorstring)|Factor is a factor to multiply the base duration after each failed retry|
|`maxDuration`|`string`|MaxDuration is the maximum amount of time allowed for the backoff strategy|

## ContinueOn

ContinueOn defines if a workflow should continue even if a task or step fails/errors. It can be specified if the workflow should continue when the pod errors, fails or both.
Expand Down Expand Up @@ -4447,9 +4448,21 @@ IntOrString is a type that can hold an int32 or a string. When used in JSON or
<summary>Examples with this field (click to open)</summary>
<br>

- [`timeouts-step.yaml`](https://github.com/argoproj/argo/blob/master/examples/timeouts-step.yaml)
- [`clustertemplates.yaml`](https://github.com/argoproj/argo/blob/master/examples/cluster-workflow-template/clustertemplates.yaml)

- [`timeouts-workflow.yaml`](https://github.com/argoproj/argo/blob/master/examples/timeouts-workflow.yaml)
- [`dag-disable-failFast.yaml`](https://github.com/argoproj/argo/blob/master/examples/dag-disable-failFast.yaml)

- [`retry-backoff.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-backoff.yaml)

- [`retry-container.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-container.yaml)

- [`retry-on-error.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-on-error.yaml)

- [`retry-script.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-script.yaml)

- [`retry-with-steps.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-with-steps.yaml)

- [`templates.yaml`](https://github.com/argoproj/argo/blob/master/examples/workflow-template/templates.yaml)
</details>

## Container
Expand Down
3 changes: 2 additions & 1 deletion docs/swagger.md
Original file line number Diff line number Diff line change
Expand Up @@ -1541,7 +1541,6 @@ Template is a reusable and composable unit of execution in a workflow
| priority | integer | Priority to apply to workflow pods. | No |
| priorityClassName | string | PriorityClassName to apply to workflow pods. | No |
| resource | [io.argoproj.workflow.v1alpha1.ResourceTemplate](#io.argoproj.workflow.v1alpha1.resourcetemplate) | Resource template subtype which can run k8s resources | No |
| resubmitPendingPods | boolean | ResubmitPendingPods is a flag to enable resubmitting pods that remain Pending after initial submission | No |
| retryStrategy | [io.argoproj.workflow.v1alpha1.RetryStrategy](#io.argoproj.workflow.v1alpha1.retrystrategy) | RetryStrategy describes how to retry a template when it fails | No |
| schedulerName | string | If specified, the pod will be dispatched by specified scheduler. Or it will be dispatched by workflow scope scheduler if specified. If neither specified, the pod will be dispatched by default scheduler. | No |
| script | [io.argoproj.workflow.v1alpha1.ScriptTemplate](#io.argoproj.workflow.v1alpha1.scripttemplate) | Script runs a portion of code against an interpreter | No |
Expand Down Expand Up @@ -1769,6 +1768,7 @@ WorkflowSpec is the specification of a Workflow.
| podPriorityClassName | string | PriorityClassName to apply to workflow pods. | No |
| podSpecPatch | string | PodSpecPatch holds strategic merge patch to apply against the pod spec. Allows parameterization of container fields which are not strings (e.g. resource limits). | No |
| priority | integer | Priority is used if controller is configured to process limited number of workflows in parallel. Workflows with higher priority are processed first. | No |
| retryStrategy | [io.argoproj.workflow.v1alpha1.RetryStrategy](#io.argoproj.workflow.v1alpha1.retrystrategy) | RetryStrategy for all templates in the io.argoproj.workflow.v1alpha1. | No |
| schedulerName | string | Set scheduler name for all pods. Will be overridden if container/script template's scheduler name is set. Default scheduler will be used if neither specified. | No |
| securityContext | [io.k8s.api.core.v1.PodSecurityContext](#io.k8s.api.core.v1.podsecuritycontext) | SecurityContext holds pod-level security attributes and common container settings. Optional: Defaults to empty. See type description for default values of each field. | No |
| serviceAccountName | string | ServiceAccountName is the name of the ServiceAccount to run all pods of the workflow as. | No |
Expand Down Expand Up @@ -1928,6 +1928,7 @@ WorkflowTemplateSpec is a spec of WorkflowTemplate.
| podPriorityClassName | string | PriorityClassName to apply to workflow pods. | No |
| podSpecPatch | string | PodSpecPatch holds strategic merge patch to apply against the pod spec. Allows parameterization of container fields which are not strings (e.g. resource limits). | No |
| priority | integer | Priority is used if controller is configured to process limited number of workflows in parallel. Workflows with higher priority are processed first. | No |
| retryStrategy | [io.argoproj.workflow.v1alpha1.RetryStrategy](#io.argoproj.workflow.v1alpha1.retrystrategy) | RetryStrategy for all templates in the io.argoproj.workflow.v1alpha1. | No |
| schedulerName | string | Set scheduler name for all pods. Will be overridden if container/script template's scheduler name is set. Default scheduler will be used if neither specified. | No |
| securityContext | [io.k8s.api.core.v1.PodSecurityContext](#io.k8s.api.core.v1.podsecuritycontext) | SecurityContext holds pod-level security attributes and common container settings. Optional: Defaults to empty. See type description for default values of each field. | No |
| serviceAccountName | string | ServiceAccountName is the name of the ServiceAccount to run all pods of the workflow as. | No |
Expand Down
38 changes: 38 additions & 0 deletions docs/tolerating-pod-deletion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Tolerating Pod Deletion

> v2.12 and after
In Kubernetes, pods are cattle and can be deleted at any time. Deletion could be manually via `kubectl delete pod`, during a node drain, or for other reasons.

This can be very inconvenient, your workflow will error, but for reasons outside of your control.

A [pod disruption budget](examples/default-pdb-support.yaml) can reduce the likelihood of this happening. But, it cannot entirely prevent it.

To retry pods that were deleted, set `retryStrategy.retryPolicy: OnError`.

This can be set at a workflow-level, template-level, or globally (using [workflow defaults](default-workflow-specs.md))

## Example

Run the following workflow (which will sleep for 30s):

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: example
spec:
retryStrategy:
retryPolicy: OnError
limit: 1
entrypoint: main
templates:
- name: main
container:
image: docker/whalesay:latest
command:
- sleep
- 30s
```

Then execute `kubectl delete pod example`. You'll see that the errored node is automatically retried.
24 changes: 22 additions & 2 deletions manifests/base/crds/full/argoproj.io_clusterworkflowtemplates.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -792,6 +792,28 @@ spec:
priority:
format: int32
type: integer
retryStrategy:
properties:
backoff:
properties:
duration:
type: string
factor:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
maxDuration:
type: string
type: object
limit:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
retryPolicy:
type: string
type: object
schedulerName:
type: string
securityContext:
Expand Down Expand Up @@ -3999,8 +4021,6 @@ spec:
required:
- action
type: object
resubmitPendingPods:
type: boolean
retryStrategy:
properties:
backoff:
Expand Down
24 changes: 22 additions & 2 deletions manifests/base/crds/full/argoproj.io_cronworkflows.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -813,6 +813,28 @@ spec:
priority:
format: int32
type: integer
retryStrategy:
properties:
backoff:
properties:
duration:
type: string
factor:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
maxDuration:
type: string
type: object
limit:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
retryPolicy:
type: string
type: object
schedulerName:
type: string
securityContext:
Expand Down Expand Up @@ -4020,8 +4042,6 @@ spec:
required:
- action
type: object
resubmitPendingPods:
type: boolean
retryStrategy:
properties:
backoff:
Expand Down
Loading

0 comments on commit fdf0b05

Please sign in to comment.