WIP: SHiELD-wrapper prognostic run #2350

spencerkclark · 2023-10-17T22:23:15Z

This is a work in progress PR adding a SHiELD-wrapper-based prognostic run. I will add more details as I clean it up / flesh it out. Prognostic run tests run locally in the prognostic_run_shield image, but I have not hooked them into CI just yet. Some debugging still needs to be done to get all the tests passing reproducibly.

Coverage reports (updated automatically):

spencerkclark · 2023-10-18T22:21:14Z

Some debugging still needs to be done to get all the tests passing reproducibly.

The puzzling finding after today is that the mere act of loading an ML model into memory (here) leads to non-reproducible results in the predictor test case with the SHiELD wrapper (here we have made sure to comment out everything related to the model that happens downstream of loading it, i.e. making predictions and setting state, so those do not appear to be the issue). I'm assuming something similar is going on related to the nudging case as it relates to loading data from reference netCDF files, though I have not had a chance to verify that.

The notable thing is that a baseline simulation (i.e. one that does not use Python at all for loading in data) produces consistently reproducible results with the same base model configuration and prognostic run time loop.

Tomorrow I will try to work on putting together a minimal example that illustrates this with a simpler Python run file, which may make iteration a little easier and faster.

spencerkclark · 2023-10-20T00:02:10Z

I was able to put together a minimal example, but it in and of itself was not particularly more illuminating than the full test cases. Given that answers were sensitive to ancillary runtime actions, my hypothesis was that there were possibly uninitialized variables in SHiELD. A way to test this is to try compiling with compiler flags that force real, integer, and logical fields to be initialized with specific values.

We did this in two ways. One was with setting uninitialized fields with values that we would expect to cause problems in the model, i.e. NaNs for reals, large values for integers, and True values for logicals:

FFLAGS += -finit-real=nan -finit-integer=9999999 -finit-logical=true

This indeed caused the model to crash, suggesting that uninitialized values existed and were influencing the simulation with the configuration I am using.

The other was with setting uninitialized fields with values that we would implicitly assume uninitialized values might take, i.e. zeros for reals and integers, and False values for logicals:

FFLAGS += -finit-real=zero -finit-integer=0 -finit-logical=false

With these compiler flags, all the prognostic run tests run in a reproducible fashion. The conclusion is therefore that uninitialized values in SHiELD are the culprit. The next step is tracking down where those uninitialized values are.

This strategy was inspired by the debug mode flags in the configure files of the fv3gfs-fortran repo; there initializing with nonsense values does not crash the model, suggesting that uninitialized fields are not an issue in that code base. These flags are notably not used in debug mode in the SHiELD build system, so it is possible some uninitialized fields have entered unnoticed. It is possible this issue was also playing a role in NOAA-GFDL/SHiELD_build#25.

spencerkclark · 2023-10-20T20:54:24Z

It appears that the uninitialized values crop up specifically when fv_core_nml.kord_wz = -9. If I switch this to fv_core_nml.kord_wz = 9 then simulations complete in the case that I initialize all real fields with NaNs. The sign of kord_wz controls whether the mode is set to -2 or -3 in the cs_profile subroutine when remapping the vertical velocity (see here); apparently the -3 mode case is the problematic one.

This provides a way forward for these tests, but it still would be good to get to the bottom of. I'm not sure if this is just some flakiness introduced when running at C12 resolution and 63 vertical levels, or if there is possibly something else going on. The motivation of using fv_core_nml.kord_wz = -9 is that it was the value of the parameter used in the PIRE simulations, which I was hoping to codify in test and reference configurations here in fv3net.

spencerkclark · 2023-10-20T21:28:24Z

I've confirmed locally that switching to fv_core_nml.kord_wz = 9 allows the prognostic run tests with the SHiELD wrapper to pass reproducibly with the PIRE-like (modulo this one parameter) configuration. I will proceed with cleaning up this PR next week, and dig into the fv_core_nml.kord_wz issue on the side, since it seems unrelated to any code in fv3net or SHiELD-wrapper.

spencerkclark · 2023-10-21T11:51:41Z

I was bored while charging my car this morning and decided to carefully work through the code in cs_profile. The issue is that when iv == -3, the array gam is not assigned a value for the lowermost model level (see here), prior to being used here. I've confirmed this is the source of the issue with some print debugging, and then by manually setting gam(:,km) to an arbitrary value and re-running a test case when compiling with -finit-real=nan; the test case completes when it would otherwise crash. I will post a more descriptive issue in the NOAA-GFDL/GFDL_atmos_cubed_sphere repository on Monday.

lharris4 · 2023-10-23T14:01:19Z

Great find @spencerkclark . This could also tie into some other mysterious crashes we have seen. Will check the revised posting ASAP.

spencerkclark · 2023-10-23T14:44:44Z

Thanks @lharris4!

For anyone else curious, further discussion of the non-reproducibility issue will take place here: NOAA-GFDL/GFDL_atmos_cubed_sphere#301.

Otherwise it complains about duplicate modules named tests: external/emulation/tests/__init__.py: error: Duplicate module named 'tests' (also at 'workflows/prognostic_c48_run/tests/__init__.py') Also disable type checking on mod.__version__ in _importorskip.

I initially implemented this by parametrizing the image, but I've come to realize that this prevents using kustomize to select a specific image associated with a commit. To address this I have switched to using conditional workflows. While this revised approach works, it could do with some de-duplication of definitions.

… config (#2362) This PR is an initial piece broken out of #2350. It adds a `wrapper` attribute to the `UserConfig` class that is used to determine which Python wrapper to import within the `main.py` run file. As of this PR, the only supported option is the default `"fv3gfs.wrapper"`, but in the abstract this adds a hook for us to ultimately also support `"shield.wrapper"`. Added public API: - A `wrapper` can now be specified within a prognostic run `UserConfig`. The only currently supported option is the default of `"fv3gfs.wrapper"`. Because of this default, this PR is fully backwards compatible (no need to change already-defined configs). Significant internal changes: - The Python wrapper is now imported within the `main` function defined in `main.py` within the prognostic run instead of the top of the module. - The wrapper is initialized within the `main` function, and thus after all top-level imports. Initial tests (including with a 2x2 layout in the cloud) suggest the odd issue that motivated the previous approach may be resolved. Perhaps this is related to the fact that we no longer use OpenMPI in the prognostic run image (we use MPICH). This logic had originally been added in #706. - The garbage collection step added in #1440 has also been removed, since it appears it is no longer needed. - [x] Tests added Coverage reports (updated automatically):

…#18) In FV3GFS, I included the diagnostics manager controlled physics diagnostics within the same data structure as the non-diagnostics-manager controlled physics diagnostics, while in SHiELD, I created a separate data structure for clarity. To enable getting diagnostics manager controlled diagnostics in the SHiELD-wrapper, some backend modifications are needed to work around this split. The frontend API remains identical, however. In the fv3net prognostic run, [we happen to always get a diagnostics manager controlled diagnostic within the time loop](https://github.com/ai2cm/fv3net/blob/92c808d13748f165d9fdaa8c0c7714f82f024eea/workflows/prognostic_c48_run/runtime/loop.py#L413-L415). This is not strictly necessary—we *could* introduce model specific logic in the time loop to work around this, as we did in ai2cm/fv3net#2350—but since this diagnostic does exist in SHiELD, we might as well enable getting it so that the fv3net prognostic run time loop (and testing logic) can remain as clean as possible. Note that this does require a minor upstream change in the SHiELD_physics repository to make some additional objects public: NOAA-GFDL/SHiELD_physics#33, which has now been merged. ### Backend changes - Subroutines `get_metadata_diagnostics`, `get_diagnostic_2d`, and `get_diagnostic_3d` within `coupler_lib.F90` now take an additional boolean argument denoting whether one would like to get data from a `diag_manager_controlled` diagnostic or not. - The dictionary returned by `_get_diagnostic_info` now uses tuple-valued keys `(module_name, name)` instead of integer keys, since the index no longer uniquely defines a diagnostic. `diag_manager_controlled` and `index` are now added as attributes to the `DiagnosticInfo` named tuple. Not only is this more general than what was implemented previously, it also makes for simpler querying of whether the requested diagnostic exists within `get_diagnostic_metadata_by_name`.

@frodre

This PR adds a Dockerfile for building a `prognostic_run_shield` image, analogous to the `prognostic_run` image, but with SHiELD-wrapper instead of fv3gfs-wrapper installed. There are a few differences from the `prognostic_run` image: - GNU 10 compilers are used instead of GNU 9, since latest FMS / SHiELD require GNU 10+ compilers to build. - SHiELD-wrapper (and its dependencies) are installed instead of fv3gfs-wrapper. In terms of compiled dependencies, this happens under the hood within the `COMPILE` script in the SHiELD_build repo. - call_py_fort is omitted, since we have not made an effort to port the microphysics-emulation-specific call_py_fort hooks into SHiELD, which are somewhat orthogonal to the Python wrapper. To make things easier to review, for now I have just skipped any tests that require running prognostic simulations. A preview of the changes required for doing that with SHiELD-wrapper can be found in #2350 (nothing major within the time loop; most changes are needed to parametrize / configure tests). In a followup PR I will add the changes needed to actually test `shield.wrapper` simulations in the same way we test `fv3gfs.wrapper`, which will depend on this PR as well as the changes in #2362. Added public API: - Added a `prognostic_run_shield` image, which will eventually be used for prognostic runs using SHiELD-wrapper. Significant internal changes: - Added logic for skipping specific regression tests if `fv3gfs.wrapper` is not installed in the image. - Added an `authenticate` call to the `test_read_last_segment_gcs` test such that it can be run without relying on previous regression tests being run before it; thanks @frodre for help debugging this. Requirement changes: - Added a SHiELD-wrapper submodule, pointing to latest main.

This PR builds on #2365 and makes the changes necessary to add `predictor` and `nudging` regression tests for SHiELD-wrapper-based prognostic runs. In so doing it refactors `test_regression.py` to parametrize over the wrapper type, and move the base fortran configs into YAML files in their own subdirectory. Note that unlike in #2350 no changes are required to the time loop of the prognostic run, thanks to ai2cm/SHiELD-wrapper#18, which has now been merged upstream. I updated the SHiELD-wrapper submodule accordingly in this PR. Significant internal changes: - Added a `wrapper` parameter to `file_configs_to_namelist_settings`, since one of the namelist parameters used to control the output frequency of physics diagnostics in SHiELD is different than that in FV3GFS. - Moved the base fortran config for the FV3GFS prognostic run regression tests into a YAML file in a subdirectory alongside the base fortran config for the SHiELD prognostic run regression tests. - [x] Tests added

spencerkclark · 2023-12-08T14:42:53Z

This PR has been split / cleaned up into #2365, #2376, and #2377. I think it is safe to close at this point.

This PR builds on #2376 and splits out from #2350 what is necessary to run SHiELD-wrapper-based prognostic simulations through our standard prognostic run argo workflow. No changes to the frontend API are needed; the prognostic run workflow is modified to infer which template (`run-fv3gfs` or `run-shield`) to run based on the input config. For convenience this also adds a starter base config for SHiELD, which is based on the configuration used in the PIRE simulations (but for simplicity with the mixed layer ocean turned off). I have tested the `prognostic-run` and `restart-prognostic-run` workflows using a SHiELD-based config offline. I'm not sure if we want to add an integration test yet or not. Significant internal changes: - Refactored the `prognostic-run` workflow to infer whether to use FV3GFS or SHiELD based on the config. - Refactored the `restart-prognostic-run` workflow to infer whether to use FV3GFS or SHiELD based on the config at the provided URL. - Refactored the directory structure of the base config YAMLs in `fv3kube` to better accommodate SHiELD configs. No user-facing changes to the FV3GFS configs are made. Note this PR makes use of YAML anchors and aliases to reduce the amount of duplicate configuration code. Some illustration of how these work can be found [here](https://support.atlassian.com/bitbucket-cloud/docs/yaml-anchors/). Use of this concept was already introduced in #2103 within the `training.yaml` template, though this is the first time using it in the prognostic run. ----- To illustrate the updated workflows I have included some example step outputs from `argo get` below (we ran the `prognostic-run` workflow for two segments and then ran one more segment via the `restart-prognostic-run` workflow). ### prognostic-run ``` STEP TEMPLATE PODNAME DURATION MESSAGE ✔ 2023-12-21-baseline-shield-example prognostic-run ├───✔ resolve-output-url resolve-output-url/resolve-output-url 2023-12-21-baseline-shield-example-681218977 3s ├───✔ convert-input-config-to-artifact convert-input-config-to-artifact 2023-12-21-baseline-shield-example-117187868 3s ├───✔ infer-wrapper infer-wrapper 2023-12-21-baseline-shield-example-3326365873 3s ├───○ prepare-config-fv3gfs prepare-config-fv3gfs when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false ├───✔ prepare-config-shield prepare-config-shield 2023-12-21-baseline-shield-example-1872594915 3m ├───○ run-model-fv3gfs run-simulation/run-fv3gfs when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false ├───✔ run-model-shield run-simulation/run-shield │ ├─┬─✔ choose-node-pool choose-node-pool 2023-12-21-baseline-shield-example-3672275837 4s │ │ └─✔ create-run create-run-shield 2023-12-21-baseline-shield-example-1572199652 3m │ └───✔ run-first-segment run-all-segments-shield │ ├───✔ append-segment append-segment-shield 2023-12-21-baseline-shield-example-2948602435 6m │ ├───✔ increment-segment increment-count 2023-12-21-baseline-shield-example-2887248493 5s │ └───✔ run-next-segment run-all-segments-shield │ ├───✔ append-segment append-segment-shield 2023-12-21-baseline-shield-example-256387028 5m │ ├───✔ increment-segment increment-count 2023-12-21-baseline-shield-example-2030482824 3s │ └───○ run-next-segment run-all-segments-shield when '2 < 2' evaluated false ├───○ online-diags prognostic-run-diags/diagnostics when 'false == true' evaluated false ├───○ online-diags-report prognostic-run-diags/report-single-run when 'false == true' evaluated false └───○ exit exit when 'Skipped == Failed || Succeeded == Failed' evaluated false ``` ### restart-prognostic-run ``` STEP TEMPLATE PODNAME DURATION MESSAGE ✔ 2023-12-21-restart-baseline-shield-example restart-prognostic-run ├───✔ choose-node-pool run-simulation/choose-node-pool 2023-12-21-restart-baseline-shield-example-1554740548 3s ├───✔ infer-wrapper infer-wrapper 2023-12-21-restart-baseline-shield-example-3018678792 4s ├───○ restart-run-fv3gfs run-simulation/run-all-segments when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false └───✔ restart-run-shield run-simulation/run-all-segments-shield ├───✔ append-segment append-segment-shield 2023-12-21-restart-baseline-shield-example-26558483 5m ├───✔ increment-segment increment-count 2023-12-21-restart-baseline-shield-example-4132074205 4s └───○ run-next-segment run-all-segments-shield when '1 < 1' evaluated false ```

spencerkclark force-pushed the SHiELD-prognostic-run branch from 2374ec2 to 57476cf Compare October 17, 2023 22:26

spencerkclark force-pushed the SHiELD-prognostic-run branch 4 times, most recently from 6876b61 to 3cf6ce7 Compare October 24, 2023 14:10

spencerkclark added 7 commits October 27, 2023 22:59

Very rough code for integration of SHiELD-wrapper into fv3net

1e98d56

Make sure to start with fresh config for each test

68bf630

Update names of FV3GFS regression tests to demonstrate those pass in CI

b784e45

Draft changes required for hooking into CI

b15f095

Add external/SHiELD-wrapper to norecursedirs

ac9f8ec

Rename model to fortran_model for clarity

6c6613e

spencerkclark force-pushed the SHiELD-prognostic-run branch 6 times, most recently from 22e15b8 to e205826 Compare October 28, 2023 10:58

spencerkclark force-pushed the SHiELD-prognostic-run branch from e205826 to e51fac0 Compare October 28, 2023 11:42

This was referenced Nov 3, 2023

Prognostic run: refactor main.py to import wrapper depending on the config #2362

Merged

Add Dockerfile for prognostic_run_shield image #2365

Merged

spencerkclark added a commit that referenced this pull request Nov 7, 2023

STry with identical Dockerfile as in #2350

290a3d3

spencerkclark added a commit that referenced this pull request Nov 7, 2023

Try with identical Dockerfile as in #2350 where the GCS test passes

87aab8f

This was referenced Nov 30, 2023

Add ability to get diagnostics manager controlled physics diagnostics ai2cm/SHiELD-wrapper#18

Merged

Add SHiELD-wrapper regression tests #2376

Merged

spencerkclark added a commit that referenced this pull request Dec 2, 2023

Split out argo integration from #2350

62a54db

spencerkclark mentioned this pull request Dec 2, 2023

Enable SHiELD runs in argo workflows #2377

Merged

spencerkclark added a commit that referenced this pull request Dec 8, 2023

Split out argo integration from #2350

32b86db

spencerkclark closed this Dec 8, 2023

spencerkclark deleted the SHiELD-prognostic-run branch December 8, 2023 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: SHiELD-wrapper prognostic run #2350

WIP: SHiELD-wrapper prognostic run #2350

spencerkclark commented Oct 17, 2023 •

edited

Loading

spencerkclark commented Oct 18, 2023

spencerkclark commented Oct 20, 2023 •

edited

Loading

spencerkclark commented Oct 20, 2023

spencerkclark commented Oct 20, 2023

spencerkclark commented Oct 21, 2023

lharris4 commented Oct 23, 2023

spencerkclark commented Oct 23, 2023

spencerkclark commented Dec 8, 2023

WIP: SHiELD-wrapper prognostic run #2350

WIP: SHiELD-wrapper prognostic run #2350

Conversation

spencerkclark commented Oct 17, 2023 • edited Loading

spencerkclark commented Oct 18, 2023

spencerkclark commented Oct 20, 2023 • edited Loading

spencerkclark commented Oct 20, 2023

spencerkclark commented Oct 20, 2023

spencerkclark commented Oct 21, 2023

lharris4 commented Oct 23, 2023

spencerkclark commented Oct 23, 2023

spencerkclark commented Dec 8, 2023

spencerkclark commented Oct 17, 2023 •

edited

Loading

spencerkclark commented Oct 20, 2023 •

edited

Loading