Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable SHiELD runs in argo workflows #2377

Merged
merged 9 commits into from
Dec 21, 2023
Merged

Conversation

spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Dec 2, 2023

This PR builds on #2376 and splits out from #2350 what is necessary to run SHiELD-wrapper-based prognostic simulations through our standard prognostic run argo workflow. No changes to the frontend API are needed; the prognostic run workflow is modified to infer which template (run-fv3gfs or run-shield) to run based on the input config.

For convenience this also adds a starter base config for SHiELD, which is based on the configuration used in the PIRE simulations (but for simplicity with the mixed layer ocean turned off). I have tested the prognostic-run and restart-prognostic-run workflows using a SHiELD-based config offline. I'm not sure if we want to add an integration test yet or not.

Significant internal changes:

  • Refactored the prognostic-run workflow to infer whether to use FV3GFS or SHiELD based on the config.
  • Refactored the restart-prognostic-run workflow to infer whether to use FV3GFS or SHiELD based on the config at the provided URL.
  • Refactored the directory structure of the base config YAMLs in fv3kube to better accommodate SHiELD configs. No user-facing changes to the FV3GFS configs are made.

Note this PR makes use of YAML anchors and aliases to reduce the amount of duplicate configuration code. Some illustration of how these work can be found here. Use of this concept was already introduced in #2103 within the training.yaml template, though this is the first time using it in the prognostic run.


To illustrate the updated workflows I have included some example step outputs from argo get below (we ran the prognostic-run workflow for two segments and then ran one more segment via the restart-prognostic-run workflow).

prognostic-run

STEP                                     TEMPLATE                                PODNAME                                        DURATION  MESSAGE
 ✔ 2023-12-21-baseline-shield-example    prognostic-run
 ├───✔ resolve-output-url                resolve-output-url/resolve-output-url   2023-12-21-baseline-shield-example-681218977   3s
 ├───✔ convert-input-config-to-artifact  convert-input-config-to-artifact        2023-12-21-baseline-shield-example-117187868   3s
 ├───✔ infer-wrapper                     infer-wrapper                           2023-12-21-baseline-shield-example-3326365873  3s
 ├───○ prepare-config-fv3gfs             prepare-config-fv3gfs                                                                            when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false
 ├───✔ prepare-config-shield             prepare-config-shield                   2023-12-21-baseline-shield-example-1872594915  3m
 ├───○ run-model-fv3gfs                  run-simulation/run-fv3gfs                                                                        when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false
 ├───✔ run-model-shield                  run-simulation/run-shield
 │   ├─┬─✔ choose-node-pool              choose-node-pool                        2023-12-21-baseline-shield-example-3672275837  4s
 │   │ └─✔ create-run                    create-run-shield                       2023-12-21-baseline-shield-example-1572199652  3m
 │   └───✔ run-first-segment             run-all-segments-shield
 │       ├───✔ append-segment            append-segment-shield                   2023-12-21-baseline-shield-example-2948602435  6m
 │       ├───✔ increment-segment         increment-count                         2023-12-21-baseline-shield-example-2887248493  5s
 │       └───✔ run-next-segment          run-all-segments-shield
 │           ├───✔ append-segment        append-segment-shield                   2023-12-21-baseline-shield-example-256387028   5m
 │           ├───✔ increment-segment     increment-count                         2023-12-21-baseline-shield-example-2030482824  3s
 │           └───○ run-next-segment      run-all-segments-shield                                                                          when '2 < 2' evaluated false
 ├───○ online-diags                      prognostic-run-diags/diagnostics                                                                 when 'false == true' evaluated false
 ├───○ online-diags-report               prognostic-run-diags/report-single-run                                                           when 'false == true' evaluated false
 └───○ exit                              exit                                                                                             when 'Skipped == Failed || Succeeded == Failed' evaluated false

restart-prognostic-run

STEP                                           TEMPLATE                                PODNAME                                                DURATION  MESSAGE
 ✔ 2023-12-21-restart-baseline-shield-example  restart-prognostic-run
 ├───✔ choose-node-pool                        run-simulation/choose-node-pool         2023-12-21-restart-baseline-shield-example-1554740548  3s
 ├───✔ infer-wrapper                           infer-wrapper                           2023-12-21-restart-baseline-shield-example-3018678792  4s
 ├───○ restart-run-fv3gfs                      run-simulation/run-all-segments                                                                          when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false
 └───✔ restart-run-shield                      run-simulation/run-all-segments-shield
     ├───✔ append-segment                      append-segment-shield                   2023-12-21-restart-baseline-shield-example-26558483    5m
     ├───✔ increment-segment                   increment-count                         2023-12-21-restart-baseline-shield-example-4132074205  4s
     └───○ run-next-segment                    run-all-segments-shield                                                                                  when '1 < 1' evaluated false

@spencerkclark spencerkclark force-pushed the shield-argo-integration branch 4 times, most recently from ba681f3 to e743580 Compare December 5, 2023 18:00
@spencerkclark spencerkclark marked this pull request as ready for review December 5, 2023 18:08
@spencerkclark spencerkclark force-pushed the SHiELD-wrapper-regression-tests branch 2 times, most recently from 1ba95a9 to bad6d92 Compare December 6, 2023 14:30
Base automatically changed from SHiELD-wrapper-regression-tests to master December 8, 2023 02:11
This references forcing data in the vcm-fv3config bucket, and makes parameters
controlling whether we use data from initial conditions or the climatology
consistent with our v0.7 FV3GFS base config.
@spencerkclark spencerkclark changed the title Enable running SHiELD-wrapper-based prognostic runs through argo Enable SHiELD runs in argo workflows Dec 8, 2023
This bumps SHiELD-wrapper to include a Q-flux bug fix, and a couple other user
experience improvements with the SOM.  Note this fix has not been merged to
SHiELD yet, so I will update this PR later once we can point to main branches
of SHiELD-wrapper and SHiELD_physics.
Copy link
Contributor

@oliverwm1 oliverwm1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit tough to review, but I think it looks good!

I would support moving the DGLBACKEND env var definition to the Dockerfile.

workflows/argo/prognostic-run.yaml Outdated Show resolved Hide resolved
@spencerkclark
Copy link
Member Author

Thanks @oliverwm1! I went ahead and set DGLBACKEND=pytorch in the Dockerfile, and bumped the SHiELD-wrapper version now that ai2cm/SHiELD-wrapper#23 has been merged. I also added some example step outputs from argo get for a prognostic-run and restart-prognostic-run workflow to illustrate the updated steps. Enabling auto-merge now.

@spencerkclark spencerkclark enabled auto-merge (squash) December 21, 2023 22:01
@spencerkclark spencerkclark merged commit b8e2f83 into master Dec 21, 2023
13 of 14 checks passed
@spencerkclark spencerkclark deleted the shield-argo-integration branch December 21, 2023 22:26
spencerkclark added a commit that referenced this pull request Jan 5, 2024
#2377 reorganized the `base_yamls` directory in fv3kube to make room for
SHiELD reference configurations, but neglected to update the
`package_data` parameter in `setup.py` accordingly. Without this change,
installing `fv3kube` via something like:

```
$ pip install git+https://github.com/ai2cm/fv3net.git@b8e2f83b5206539724a4d096d0433ceeb3bc805a#egg=fv3kube&subdirectory=external/fv3kube
```

does not include the `base_yamls` files, which are an important
component of the library. I've tested this locally and it fixes the
issue, e.g. use:

```
$ pip install git+https://github.com/ai2cm/fv3net.git@8fe01cd6c49ea635a1b07afd4ee4615db7555ce6#egg=fv3kube&subdirectory=external/fv3kube
```

and check for the existence of the `base_yamls` directory (note the
different SHA from the original example).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants