Enable SHiELD runs in argo workflows #2377

spencerkclark · 2023-12-02T01:24:19Z

This PR builds on #2376 and splits out from #2350 what is necessary to run SHiELD-wrapper-based prognostic simulations through our standard prognostic run argo workflow. No changes to the frontend API are needed; the prognostic run workflow is modified to infer which template (run-fv3gfs or run-shield) to run based on the input config.

For convenience this also adds a starter base config for SHiELD, which is based on the configuration used in the PIRE simulations (but for simplicity with the mixed layer ocean turned off). I have tested the prognostic-run and restart-prognostic-run workflows using a SHiELD-based config offline. I'm not sure if we want to add an integration test yet or not.

Significant internal changes:

Refactored the prognostic-run workflow to infer whether to use FV3GFS or SHiELD based on the config.
Refactored the restart-prognostic-run workflow to infer whether to use FV3GFS or SHiELD based on the config at the provided URL.
Refactored the directory structure of the base config YAMLs in fv3kube to better accommodate SHiELD configs. No user-facing changes to the FV3GFS configs are made.

Note this PR makes use of YAML anchors and aliases to reduce the amount of duplicate configuration code. Some illustration of how these work can be found here. Use of this concept was already introduced in #2103 within the training.yaml template, though this is the first time using it in the prognostic run.

To illustrate the updated workflows I have included some example step outputs from argo get below (we ran the prognostic-run workflow for two segments and then ran one more segment via the restart-prognostic-run workflow).

prognostic-run

STEP                                     TEMPLATE                                PODNAME                                        DURATION  MESSAGE
 ✔ 2023-12-21-baseline-shield-example    prognostic-run
 ├───✔ resolve-output-url                resolve-output-url/resolve-output-url   2023-12-21-baseline-shield-example-681218977   3s
 ├───✔ convert-input-config-to-artifact  convert-input-config-to-artifact        2023-12-21-baseline-shield-example-117187868   3s
 ├───✔ infer-wrapper                     infer-wrapper                           2023-12-21-baseline-shield-example-3326365873  3s
 ├───○ prepare-config-fv3gfs             prepare-config-fv3gfs                                                                            when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false
 ├───✔ prepare-config-shield             prepare-config-shield                   2023-12-21-baseline-shield-example-1872594915  3m
 ├───○ run-model-fv3gfs                  run-simulation/run-fv3gfs                                                                        when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false
 ├───✔ run-model-shield                  run-simulation/run-shield
 │   ├─┬─✔ choose-node-pool              choose-node-pool                        2023-12-21-baseline-shield-example-3672275837  4s
 │   │ └─✔ create-run                    create-run-shield                       2023-12-21-baseline-shield-example-1572199652  3m
 │   └───✔ run-first-segment             run-all-segments-shield
 │       ├───✔ append-segment            append-segment-shield                   2023-12-21-baseline-shield-example-2948602435  6m
 │       ├───✔ increment-segment         increment-count                         2023-12-21-baseline-shield-example-2887248493  5s
 │       └───✔ run-next-segment          run-all-segments-shield
 │           ├───✔ append-segment        append-segment-shield                   2023-12-21-baseline-shield-example-256387028   5m
 │           ├───✔ increment-segment     increment-count                         2023-12-21-baseline-shield-example-2030482824  3s
 │           └───○ run-next-segment      run-all-segments-shield                                                                          when '2 < 2' evaluated false
 ├───○ online-diags                      prognostic-run-diags/diagnostics                                                                 when 'false == true' evaluated false
 ├───○ online-diags-report               prognostic-run-diags/report-single-run                                                           when 'false == true' evaluated false
 └───○ exit                              exit                                                                                             when 'Skipped == Failed || Succeeded == Failed' evaluated false

restart-prognostic-run

STEP                                           TEMPLATE                                PODNAME                                                DURATION  MESSAGE
 ✔ 2023-12-21-restart-baseline-shield-example  restart-prognostic-run
 ├───✔ choose-node-pool                        run-simulation/choose-node-pool         2023-12-21-restart-baseline-shield-example-1554740548  3s
 ├───✔ infer-wrapper                           infer-wrapper                           2023-12-21-restart-baseline-shield-example-3018678792  4s
 ├───○ restart-run-fv3gfs                      run-simulation/run-all-segments                                                                          when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false
 └───✔ restart-run-shield                      run-simulation/run-all-segments-shield
     ├───✔ append-segment                      append-segment-shield                   2023-12-21-restart-baseline-shield-example-26558483    5m
     ├───✔ increment-segment                   increment-count                         2023-12-21-restart-baseline-shield-example-4132074205  4s
     └───○ run-next-segment                    run-all-segments-shield                                                                                  when '1 < 1' evaluated false

This references forcing data in the vcm-fv3config bucket, and makes parameters controlling whether we use data from initial conditions or the climatology consistent with our v0.7 FV3GFS base config.

This bumps SHiELD-wrapper to include a Q-flux bug fix, and a couple other user experience improvements with the SOM. Note this fix has not been merged to SHiELD yet, so I will update this PR later once we can point to main branches of SHiELD-wrapper and SHiELD_physics.

external/SHiELD-wrapper

oliverwm1

A bit tough to review, but I think it looks good!

I would support moving the DGLBACKEND env var definition to the Dockerfile.

workflows/argo/prognostic-run.yaml

spencerkclark · 2023-12-21T22:01:28Z

Thanks @oliverwm1! I went ahead and set DGLBACKEND=pytorch in the Dockerfile, and bumped the SHiELD-wrapper version now that ai2cm/SHiELD-wrapper#23 has been merged. I also added some example step outputs from argo get for a prognostic-run and restart-prognostic-run workflow to illustrate the updated steps. Enabling auto-merge now.

#2377 reorganized the `base_yamls` directory in fv3kube to make room for SHiELD reference configurations, but neglected to update the `package_data` parameter in `setup.py` accordingly. Without this change, installing `fv3kube` via something like: ``` $ pip install git+https://github.com/ai2cm/fv3net.git@b8e2f83b5206539724a4d096d0433ceeb3bc805a#egg=fv3kube&subdirectory=external/fv3kube ``` does not include the `base_yamls` files, which are an important component of the library. I've tested this locally and it fixes the issue, e.g. use: ``` $ pip install git+https://github.com/ai2cm/fv3net.git@8fe01cd6c49ea635a1b07afd4ee4615db7555ce6#egg=fv3kube&subdirectory=external/fv3kube ``` and check for the existence of the `base_yamls` directory (note the different SHA from the original example).

spencerkclark force-pushed the shield-argo-integration branch 4 times, most recently from ba681f3 to e743580 Compare December 5, 2023 18:00

spencerkclark marked this pull request as ready for review December 5, 2023 18:08

spencerkclark force-pushed the SHiELD-wrapper-regression-tests branch 2 times, most recently from 1ba95a9 to bad6d92 Compare December 6, 2023 14:30

Base automatically changed from SHiELD-wrapper-regression-tests to master December 8, 2023 02:11

spencerkclark added 4 commits December 8, 2023 13:34

Split out argo integration from #2350

32b86db

Reference identical workflow steps where possible

163add7

Leverage YAML anchors and aliases to reduce repetition

e504e9e

Update forcing in SHiELD base config

fac244f

This references forcing data in the vcm-fv3config bucket, and makes parameters controlling whether we use data from initial conditions or the climatology consistent with our v0.7 FV3GFS base config.

spencerkclark force-pushed the shield-argo-integration branch from e743580 to fac244f Compare December 8, 2023 13:43

spencerkclark changed the title ~~Enable running SHiELD-wrapper-based prognostic runs through argo~~ Enable SHiELD runs in argo workflows Dec 8, 2023

spencerkclark mentioned this pull request Dec 8, 2023

WIP: SHiELD-wrapper prognostic run #2350

Closed

spencerkclark added 2 commits December 13, 2023 17:42

Push fix to name field in templateRef we made earlier

942ab21

spencerkclark commented Dec 19, 2023

View reviewed changes

external/SHiELD-wrapper Outdated Show resolved Hide resolved

Point SHiELD wrapper to a commit on main

d6fd275

oliverwm1 approved these changes Dec 21, 2023

View reviewed changes

workflows/argo/prognostic-run.yaml Outdated Show resolved Hide resolved

spencerkclark added 2 commits December 21, 2023 19:45

Set DGLBACKEND=pytorch in prognostic_run_shield image

88f91a0

Bump SHiELD-wrapper to for Q-flux bug fix and diagnostic updates

2148134

spencerkclark enabled auto-merge (squash) December 21, 2023 22:01

spencerkclark merged commit b8e2f83 into master Dec 21, 2023
13 of 14 checks passed

spencerkclark deleted the shield-argo-integration branch December 21, 2023 22:26

spencerkclark mentioned this pull request Jan 5, 2024

fv3kube: update package_data in setup.py #2386

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable SHiELD runs in argo workflows #2377

Enable SHiELD runs in argo workflows #2377

spencerkclark commented Dec 2, 2023 •

edited

Loading

oliverwm1 left a comment

spencerkclark commented Dec 21, 2023

Enable SHiELD runs in argo workflows #2377

Enable SHiELD runs in argo workflows #2377

Conversation

spencerkclark commented Dec 2, 2023 • edited Loading

prognostic-run

restart-prognostic-run

oliverwm1 left a comment

Choose a reason for hiding this comment

spencerkclark commented Dec 21, 2023

spencerkclark commented Dec 2, 2023 •

edited

Loading