Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: SHiELD-wrapper prognostic run #2350

Closed
wants to merge 8 commits into from
Closed

Conversation

spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Oct 17, 2023

This is a work in progress PR adding a SHiELD-wrapper-based prognostic run. I will add more details as I clean it up / flesh it out. Prognostic run tests run locally in the prognostic_run_shield image, but I have not hooked them into CI just yet. Some debugging still needs to be done to get all the tests passing reproducibly.

Coverage reports (updated automatically):

@spencerkclark
Copy link
Member Author

Some debugging still needs to be done to get all the tests passing reproducibly.

The puzzling finding after today is that the mere act of loading an ML model into memory (here) leads to non-reproducible results in the predictor test case with the SHiELD wrapper (here we have made sure to comment out everything related to the model that happens downstream of loading it, i.e. making predictions and setting state, so those do not appear to be the issue). I'm assuming something similar is going on related to the nudging case as it relates to loading data from reference netCDF files, though I have not had a chance to verify that.

The notable thing is that a baseline simulation (i.e. one that does not use Python at all for loading in data) produces consistently reproducible results with the same base model configuration and prognostic run time loop.

Tomorrow I will try to work on putting together a minimal example that illustrates this with a simpler Python run file, which may make iteration a little easier and faster.

@spencerkclark
Copy link
Member Author

spencerkclark commented Oct 20, 2023

I was able to put together a minimal example, but it in and of itself was not particularly more illuminating than the full test cases. Given that answers were sensitive to ancillary runtime actions, my hypothesis was that there were possibly uninitialized variables in SHiELD. A way to test this is to try compiling with compiler flags that force real, integer, and logical fields to be initialized with specific values.

We did this in two ways. One was with setting uninitialized fields with values that we would expect to cause problems in the model, i.e. NaNs for reals, large values for integers, and True values for logicals:

FFLAGS += -finit-real=nan -finit-integer=9999999 -finit-logical=true

This indeed caused the model to crash, suggesting that uninitialized values existed and were influencing the simulation with the configuration I am using.

The other was with setting uninitialized fields with values that we would implicitly assume uninitialized values might take, i.e. zeros for reals and integers, and False values for logicals:

FFLAGS += -finit-real=zero -finit-integer=0 -finit-logical=false

With these compiler flags, all the prognostic run tests run in a reproducible fashion. The conclusion is therefore that uninitialized values in SHiELD are the culprit. The next step is tracking down where those uninitialized values are.

This strategy was inspired by the debug mode flags in the configure files of the fv3gfs-fortran repo; there initializing with nonsense values does not crash the model, suggesting that uninitialized fields are not an issue in that code base. These flags are notably not used in debug mode in the SHiELD build system, so it is possible some uninitialized fields have entered unnoticed. It is possible this issue was also playing a role in NOAA-GFDL/SHiELD_build#25.

@spencerkclark
Copy link
Member Author

It appears that the uninitialized values crop up specifically when fv_core_nml.kord_wz = -9. If I switch this to fv_core_nml.kord_wz = 9 then simulations complete in the case that I initialize all real fields with NaNs. The sign of kord_wz controls whether the mode is set to -2 or -3 in the cs_profile subroutine when remapping the vertical velocity (see here); apparently the -3 mode case is the problematic one.

This provides a way forward for these tests, but it still would be good to get to the bottom of. I'm not sure if this is just some flakiness introduced when running at C12 resolution and 63 vertical levels, or if there is possibly something else going on. The motivation of using fv_core_nml.kord_wz = -9 is that it was the value of the parameter used in the PIRE simulations, which I was hoping to codify in test and reference configurations here in fv3net.

@spencerkclark
Copy link
Member Author

I've confirmed locally that switching to fv_core_nml.kord_wz = 9 allows the prognostic run tests with the SHiELD wrapper to pass reproducibly with the PIRE-like (modulo this one parameter) configuration. I will proceed with cleaning up this PR next week, and dig into the fv_core_nml.kord_wz issue on the side, since it seems unrelated to any code in fv3net or SHiELD-wrapper.

@spencerkclark
Copy link
Member Author

I was bored while charging my car this morning and decided to carefully work through the code in cs_profile. The issue is that when iv == -3, the array gam is not assigned a value for the lowermost model level (see here), prior to being used here. I've confirmed this is the source of the issue with some print debugging, and then by manually setting gam(:,km) to an arbitrary value and re-running a test case when compiling with -finit-real=nan; the test case completes when it would otherwise crash. I will post a more descriptive issue in the NOAA-GFDL/GFDL_atmos_cubed_sphere repository on Monday.

@lharris4
Copy link

Great find @spencerkclark . This could also tie into some other mysterious crashes we have seen. Will check the revised posting ASAP.

@spencerkclark
Copy link
Member Author

Thanks @lharris4!

For anyone else curious, further discussion of the non-reproducibility issue will take place here: NOAA-GFDL/GFDL_atmos_cubed_sphere#301.

@spencerkclark spencerkclark force-pushed the SHiELD-prognostic-run branch 4 times, most recently from 6876b61 to 3cf6ce7 Compare October 24, 2023 14:10
@spencerkclark spencerkclark force-pushed the SHiELD-prognostic-run branch 6 times, most recently from 22e15b8 to e205826 Compare October 28, 2023 10:58
I initially implemented this by parametrizing the image, but I've come to
realize that this prevents using kustomize to select a specific image
associated with a commit.  To address this I have switched to using conditional
workflows.  While this revised approach works, it could do with some
de-duplication of definitions.
spencerkclark added a commit that referenced this pull request Nov 8, 2023
… config (#2362)

This PR is an initial piece broken out of #2350. It adds a `wrapper`
attribute to the `UserConfig` class that is used to determine which
Python wrapper to import within the `main.py` run file. As of this PR,
the only supported option is the default `"fv3gfs.wrapper"`, but in the
abstract this adds a hook for us to ultimately also support
`"shield.wrapper"`.

Added public API:
- A `wrapper` can now be specified within a prognostic run `UserConfig`.
The only currently supported option is the default of
`"fv3gfs.wrapper"`. Because of this default, this PR is fully backwards
compatible (no need to change already-defined configs).

Significant internal changes:
- The Python wrapper is now imported within the `main` function defined
in `main.py` within the prognostic run instead of the top of the module.
- The wrapper is initialized within the `main` function, and thus after
all top-level imports. Initial tests (including with a 2x2 layout in the
cloud) suggest the odd issue that motivated the previous approach may be
resolved. Perhaps this is related to the fact that we no longer use
OpenMPI in the prognostic run image (we use MPICH). This logic had
originally been added in #706.
- The garbage collection step added in
#1440 has also been removed, since
it appears it is no longer needed.

- [x] Tests added

Coverage reports (updated automatically):
spencerkclark added a commit that referenced this pull request Dec 2, 2023
spencerkclark added a commit to ai2cm/SHiELD-wrapper that referenced this pull request Dec 6, 2023
…#18)

In FV3GFS, I included the diagnostics manager controlled physics diagnostics within the same data structure as the non-diagnostics-manager controlled physics diagnostics, while in SHiELD, I created a separate data structure for clarity.  To enable getting diagnostics manager controlled diagnostics in the SHiELD-wrapper, some backend modifications are needed to work around this split.  The frontend API remains identical, however.

In the fv3net prognostic run, [we happen to always get a diagnostics manager controlled diagnostic within the time loop](https://github.com/ai2cm/fv3net/blob/92c808d13748f165d9fdaa8c0c7714f82f024eea/workflows/prognostic_c48_run/runtime/loop.py#L413-L415).  This is not strictly necessary—we *could* introduce model specific logic in the time loop to work around this, as we did in ai2cm/fv3net#2350—but since this diagnostic does exist in SHiELD, we might as well enable getting it so that the fv3net prognostic run time loop (and testing logic) can remain as clean as possible.

Note that this does require a minor upstream change in the SHiELD_physics repository to make some additional objects public: NOAA-GFDL/SHiELD_physics#33, which has now been merged.

### Backend changes

- Subroutines `get_metadata_diagnostics`, `get_diagnostic_2d`, and `get_diagnostic_3d` within `coupler_lib.F90` now take an additional boolean argument denoting whether one would like to get data from a `diag_manager_controlled` diagnostic or not.
- The dictionary returned by `_get_diagnostic_info` now uses tuple-valued keys `(module_name, name)` instead of integer keys, since the index no longer uniquely defines a diagnostic.  `diag_manager_controlled` and `index` are now added as attributes to the `DiagnosticInfo` named tuple.  Not only is this more general than what was implemented previously, it also makes for simpler querying of whether the requested diagnostic exists within `get_diagnostic_metadata_by_name`.
spencerkclark added a commit that referenced this pull request Dec 6, 2023
This PR adds a Dockerfile for building a `prognostic_run_shield` image,
analogous to the `prognostic_run` image, but with SHiELD-wrapper instead
of fv3gfs-wrapper installed. There are a few differences from the
`prognostic_run` image:

- GNU 10 compilers are used instead of GNU 9, since latest FMS / SHiELD
require GNU 10+ compilers to build.
- SHiELD-wrapper (and its dependencies) are installed instead of
fv3gfs-wrapper. In terms of compiled dependencies, this happens under
the hood within the `COMPILE` script in the SHiELD_build repo.
- call_py_fort is omitted, since we have not made an effort to port the
microphysics-emulation-specific call_py_fort hooks into SHiELD, which
are somewhat orthogonal to the Python wrapper.

To make things easier to review, for now I have just skipped any tests
that require running prognostic simulations. A preview of the changes
required for doing that with SHiELD-wrapper can be found in #2350
(nothing major within the time loop; most changes are needed to
parametrize / configure tests). In a followup PR I will add the changes
needed to actually test `shield.wrapper` simulations in the same way we
test `fv3gfs.wrapper`, which will depend on this PR as well as the
changes in #2362.

Added public API:
- Added a `prognostic_run_shield` image, which will eventually be used
for prognostic runs using SHiELD-wrapper.

Significant internal changes:
- Added logic for skipping specific regression tests if `fv3gfs.wrapper`
is not installed in the image.
- Added an `authenticate` call to the `test_read_last_segment_gcs` test
such that it can be run without relying on previous regression tests
being run before it; thanks @frodre for help debugging this.

Requirement changes:
- Added a SHiELD-wrapper submodule, pointing to latest main.
spencerkclark added a commit that referenced this pull request Dec 8, 2023
This PR builds on #2365 and makes the changes necessary to add
`predictor` and `nudging` regression tests for SHiELD-wrapper-based
prognostic runs. In so doing it refactors `test_regression.py` to
parametrize over the wrapper type, and move the base fortran configs
into YAML files in their own subdirectory.

Note that unlike in #2350 no changes are required to the time loop of
the prognostic run, thanks to ai2cm/SHiELD-wrapper#18, which has now
been merged upstream. I updated the SHiELD-wrapper submodule accordingly
in this PR.

Significant internal changes:
- Added a `wrapper` parameter to `file_configs_to_namelist_settings`,
since one of the namelist parameters used to control the output
frequency of physics diagnostics in SHiELD is different than that in
FV3GFS.
- Moved the base fortran config for the FV3GFS prognostic run regression
tests into a YAML file in a subdirectory alongside the base fortran
config for the SHiELD prognostic run regression tests.

- [x] Tests added
spencerkclark added a commit that referenced this pull request Dec 8, 2023
@spencerkclark
Copy link
Member Author

This PR has been split / cleaned up into #2365, #2376, and #2377. I think it is safe to close at this point.

@spencerkclark spencerkclark deleted the SHiELD-prognostic-run branch December 8, 2023 14:43
spencerkclark added a commit that referenced this pull request Dec 21, 2023
This PR builds on #2376 and splits out from #2350 what is necessary to
run SHiELD-wrapper-based prognostic simulations through our standard
prognostic run argo workflow. No changes to the frontend API are needed;
the prognostic run workflow is modified to infer which template
(`run-fv3gfs` or `run-shield`) to run based on the input config.

For convenience this also adds a starter base config for SHiELD, which
is based on the configuration used in the PIRE simulations (but for
simplicity with the mixed layer ocean turned off). I have tested the
`prognostic-run` and `restart-prognostic-run` workflows using a
SHiELD-based config offline. I'm not sure if we want to add an
integration test yet or not.

Significant internal changes:
- Refactored the `prognostic-run` workflow to infer whether to use
FV3GFS or SHiELD based on the config.
- Refactored the `restart-prognostic-run` workflow to infer whether to
use FV3GFS or SHiELD based on the config at the provided URL.
- Refactored the directory structure of the base config YAMLs in
`fv3kube` to better accommodate SHiELD configs. No user-facing changes
to the FV3GFS configs are made.

Note this PR makes use of YAML anchors and aliases to reduce the amount
of duplicate configuration code. Some illustration of how these work can
be found
[here](https://support.atlassian.com/bitbucket-cloud/docs/yaml-anchors/).
Use of this concept was already introduced in
#2103 within the `training.yaml`
template, though this is the first time using it in the prognostic run.

-----

To illustrate the updated workflows I have included some example step
outputs from `argo get` below (we ran the `prognostic-run` workflow for
two segments and then ran one more segment via the
`restart-prognostic-run` workflow).

### prognostic-run

```
STEP                                     TEMPLATE                                PODNAME                                        DURATION  MESSAGE
 ✔ 2023-12-21-baseline-shield-example    prognostic-run
 ├───✔ resolve-output-url                resolve-output-url/resolve-output-url   2023-12-21-baseline-shield-example-681218977   3s
 ├───✔ convert-input-config-to-artifact  convert-input-config-to-artifact        2023-12-21-baseline-shield-example-117187868   3s
 ├───✔ infer-wrapper                     infer-wrapper                           2023-12-21-baseline-shield-example-3326365873  3s
 ├───○ prepare-config-fv3gfs             prepare-config-fv3gfs                                                                            when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false
 ├───✔ prepare-config-shield             prepare-config-shield                   2023-12-21-baseline-shield-example-1872594915  3m
 ├───○ run-model-fv3gfs                  run-simulation/run-fv3gfs                                                                        when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false
 ├───✔ run-model-shield                  run-simulation/run-shield
 │   ├─┬─✔ choose-node-pool              choose-node-pool                        2023-12-21-baseline-shield-example-3672275837  4s
 │   │ └─✔ create-run                    create-run-shield                       2023-12-21-baseline-shield-example-1572199652  3m
 │   └───✔ run-first-segment             run-all-segments-shield
 │       ├───✔ append-segment            append-segment-shield                   2023-12-21-baseline-shield-example-2948602435  6m
 │       ├───✔ increment-segment         increment-count                         2023-12-21-baseline-shield-example-2887248493  5s
 │       └───✔ run-next-segment          run-all-segments-shield
 │           ├───✔ append-segment        append-segment-shield                   2023-12-21-baseline-shield-example-256387028   5m
 │           ├───✔ increment-segment     increment-count                         2023-12-21-baseline-shield-example-2030482824  3s
 │           └───○ run-next-segment      run-all-segments-shield                                                                          when '2 < 2' evaluated false
 ├───○ online-diags                      prognostic-run-diags/diagnostics                                                                 when 'false == true' evaluated false
 ├───○ online-diags-report               prognostic-run-diags/report-single-run                                                           when 'false == true' evaluated false
 └───○ exit                              exit                                                                                             when 'Skipped == Failed || Succeeded == Failed' evaluated false
```

### restart-prognostic-run

```
STEP                                           TEMPLATE                                PODNAME                                                DURATION  MESSAGE
 ✔ 2023-12-21-restart-baseline-shield-example  restart-prognostic-run
 ├───✔ choose-node-pool                        run-simulation/choose-node-pool         2023-12-21-restart-baseline-shield-example-1554740548  3s
 ├───✔ infer-wrapper                           infer-wrapper                           2023-12-21-restart-baseline-shield-example-3018678792  4s
 ├───○ restart-run-fv3gfs                      run-simulation/run-all-segments                                                                          when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false
 └───✔ restart-run-shield                      run-simulation/run-all-segments-shield
     ├───✔ append-segment                      append-segment-shield                   2023-12-21-restart-baseline-shield-example-26558483    5m
     ├───✔ increment-segment                   increment-count                         2023-12-21-restart-baseline-shield-example-4132074205  4s
     └───○ run-next-segment                    run-all-segments-shield                                                                                  when '1 < 1' evaluated false
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants