New Feature: provide failFast flag, allow a DAG to run all branches of the DAG (either success or failure) #1442

xianlubird · 2019-06-20T11:32:09Z

I have a workflow yaml like this.

dag test case

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dag-primay-branch-
spec:
  entrypoint: statis
  templates:
  - name: a
    container:
      image:  docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]
  - name: b
    retryStrategy:
      limit: 2
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 30; echo haha"]
  - name: c
    retryStrategy:
      limit: 3
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo intentional failure; exit 2"]
  - name: d
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]
  - name: statis
    dag:
      tasks:
      - name: A
        template: a
      - name: B
        dependencies: [A]
        template: b
      - name: C
        dependencies: [A]
        template: c
      - name: D
        dependencies: [B]
        template: d
      - name: E
        dependencies: [D]
        template: d

The dependencies are as follows :

step1：      A
          /     \
step2：  B       C
         |
step3：  D
         |
step4：  E

When C node is failed, the workflow will stop at B and won't process on. The output like this:

Name:                dag-primay-branch-b2l5l
Namespace:           default
ServiceAccount:      default
Status:              Failed
Created:             Thu Jun 20 19:21:39 +0800 (6 minutes ago)
Started:             Thu Jun 20 19:21:39 +0800 (6 minutes ago)
Finished:            Thu Jun 20 19:22:17 +0800 (6 minutes ago)
Duration:            38 seconds

STEP                        PODNAME                             DURATION  MESSAGE
 ✖ dag-primay-branch-b2l5l
 ├-✔ A                      dag-primay-branch-b2l5l-430930318   3s
 ├-✔ B(0)                   dag-primay-branch-b2l5l-1687888246  33s
 └-✖ C                                                                    No more retries left
   ├-✖ C(0)                 dag-primay-branch-b2l5l-2802686203  5s        failed with exit code 2
   ├-✖ C(1)                 dag-primay-branch-b2l5l-2333059966  4s        failed with exit code 2
   ├-✖ C(2)                 dag-primay-branch-b2l5l-3004311821  4s        failed with exit code 2
   └-✖ C(3)                 dag-primay-branch-b2l5l-118811616   3s        failed with exit code 2

But If I remove B template retryStrategy label the yaml like this

new test case

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dag-primay-branch-
spec:
  entrypoint: statis
  templates:
  - name: a
    container:
      image:  docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]
  - name: b
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 30; echo haha"]
  - name: c
    retryStrategy:
      limit: 3
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo intentional failure; exit 2"]
  - name: d
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]
  - name: statis
    dag:
      tasks:
      - name: A
        template: a
      - name: B
        dependencies: [A]
        template: b
      - name: C
        dependencies: [A]
        template: c
      - name: D
        dependencies: [B]
        template: d
      - name: E
        dependencies: [D]
        template: d

The D E step will go on when C is failed. The output like this

Name:                dag-primay-branch-ksbcv
Namespace:           default
ServiceAccount:      default
Status:              Failed
Created:             Thu Jun 20 19:30:08 +0800 (48 seconds ago)
Started:             Thu Jun 20 19:30:08 +0800 (48 seconds ago)
Finished:            Thu Jun 20 19:30:56 +0800 (now)
Duration:            48 seconds

STEP                        PODNAME                             DURATION  MESSAGE
 ✖ dag-primay-branch-ksbcv
 ├-✔ A                      dag-primay-branch-ksbcv-4005668674  3s
 ├-✔ B                      dag-primay-branch-ksbcv-3988891055  35s
 ├-✖ C                                                                    No more retries left
 | ├-✖ C(0)                 dag-primay-branch-ksbcv-1785421399  3s        failed with exit code 2
 | ├-✖ C(1)                 dag-primay-branch-ksbcv-2389562778  3s        failed with exit code 2
 | ├-✖ C(2)                 dag-primay-branch-ksbcv-1718605113  3s        failed with exit code 2
 | └-✖ C(3)                 dag-primay-branch-ksbcv-2322746492  4s        failed with exit code 2
 ├-✔ D                      dag-primay-branch-ksbcv-3955335817  3s
 └-✔ E                      dag-primay-branch-ksbcv-3938558198  4s

This seems relate to retryStrategy on parent node on B node.

Could help me to solve this problem? Thank you @sarabala1979 @jessesuen

The text was updated successfully, but these errors were encountered:

jessesuen · 2019-06-25T00:21:06Z

@xianlubird this behavior is actually the intended behavior. Essentially, the DAG logic has a built-in "fail fast" feature to stop scheduling new steps, as soon as it detects that one of the DAG nodes is failed. Then it waits until all DAG nodes are completed before failing the DAG itself.

The reason that you see difference in behavior affected by the retryStrategy of your workflow, is timing related, more than than the retryStrategy.

I believe what you are trying to achieve, is to allow a DAG to run all branches of the DAG to completion (either success or failure), regardless of the failed outcomes of branches in the DAG. For this, we need a new feature to disable the "fail fast" behavior, which would be a new feature.

We need a new flag for this, because the fail fast behavior of DAGs is desirable behavior for many use cases, and we do not want to break backwards compatibility.

saiyidi · 2020-05-07T14:57:54Z

Hi Argo experts, I wish to add cron schedule to a child job.
Like- A runs at 1 AM and B runs at 2 AM and then C runs as a dependant job of A and B at say 4 AM.
A (1AM) B (2AM)
\ OR /
C (4 AM)

Would highly appreciate any help on this

xianlubird mentioned this issue Jun 21, 2019

New Feature: provide failFast flag, allow a DAG to run all branches of the DAG (either success or failure) #1443

Merged

xianlubird changed the title ~~BUG: dag will missing some nodes when another branch node fails.~~ New Feature: provide failFast flag, allow a DAG to run all branches of the DAG (either success or failure) Jun 25, 2019

xianlubird closed this as completed Jul 1, 2019

epa095 mentioned this issue Dec 10, 2019

Set workflow failfast equinor/gordo#782

Merged

saiyidi mentioned this issue May 7, 2020

Hi Argo experts, I wish to add cron worflow schedule to a child job. argoproj/argo-events#645

Closed

agilgur5 added area/templates/dag type/feature Feature request labels Oct 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Feature: provide failFast flag, allow a DAG to run all branches of the DAG (either success or failure) #1442

New Feature: provide failFast flag, allow a DAG to run all branches of the DAG (either success or failure) #1442

xianlubird commented Jun 20, 2019 •

edited

Loading

jessesuen commented Jun 25, 2019 •

edited

Loading

saiyidi commented May 7, 2020

New Feature: provide failFast flag, allow a DAG to run all branches of the DAG (either success or failure) #1442

New Feature: provide failFast flag, allow a DAG to run all branches of the DAG (either success or failure) #1442

Comments

xianlubird commented Jun 20, 2019 • edited Loading

jessesuen commented Jun 25, 2019 • edited Loading

saiyidi commented May 7, 2020

xianlubird commented Jun 20, 2019 •

edited

Loading

jessesuen commented Jun 25, 2019 •

edited

Loading