-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: SHiELD-wrapper prognostic run #2350
Conversation
2374ec2
to
57476cf
Compare
The puzzling finding after today is that the mere act of loading an ML model into memory (here) leads to non-reproducible results in the The notable thing is that a baseline simulation (i.e. one that does not use Python at all for loading in data) produces consistently reproducible results with the same base model configuration and prognostic run time loop. Tomorrow I will try to work on putting together a minimal example that illustrates this with a simpler Python run file, which may make iteration a little easier and faster. |
I was able to put together a minimal example, but it in and of itself was not particularly more illuminating than the full test cases. Given that answers were sensitive to ancillary runtime actions, my hypothesis was that there were possibly uninitialized variables in SHiELD. A way to test this is to try compiling with compiler flags that force real, integer, and logical fields to be initialized with specific values. We did this in two ways. One was with setting uninitialized fields with values that we would expect to cause problems in the model, i.e. NaNs for reals, large values for integers, and True values for logicals:
This indeed caused the model to crash, suggesting that uninitialized values existed and were influencing the simulation with the configuration I am using. The other was with setting uninitialized fields with values that we would implicitly assume uninitialized values might take, i.e. zeros for reals and integers, and False values for logicals:
With these compiler flags, all the prognostic run tests run in a reproducible fashion. The conclusion is therefore that uninitialized values in SHiELD are the culprit. The next step is tracking down where those uninitialized values are. This strategy was inspired by the debug mode flags in the configure files of the fv3gfs-fortran repo; there initializing with nonsense values does not crash the model, suggesting that uninitialized fields are not an issue in that code base. These flags are notably not used in debug mode in the SHiELD build system, so it is possible some uninitialized fields have entered unnoticed. It is possible this issue was also playing a role in NOAA-GFDL/SHiELD_build#25. |
It appears that the uninitialized values crop up specifically when This provides a way forward for these tests, but it still would be good to get to the bottom of. I'm not sure if this is just some flakiness introduced when running at C12 resolution and 63 vertical levels, or if there is possibly something else going on. The motivation of using |
I've confirmed locally that switching to |
I was bored while charging my car this morning and decided to carefully work through the code in |
Great find @spencerkclark . This could also tie into some other mysterious crashes we have seen. Will check the revised posting ASAP. |
Thanks @lharris4! For anyone else curious, further discussion of the non-reproducibility issue will take place here: NOAA-GFDL/GFDL_atmos_cubed_sphere#301. |
6876b61
to
3cf6ce7
Compare
Otherwise it complains about duplicate modules named tests: external/emulation/tests/__init__.py: error: Duplicate module named 'tests' (also at 'workflows/prognostic_c48_run/tests/__init__.py') Also disable type checking on mod.__version__ in _importorskip.
22e15b8
to
e205826
Compare
I initially implemented this by parametrizing the image, but I've come to realize that this prevents using kustomize to select a specific image associated with a commit. To address this I have switched to using conditional workflows. While this revised approach works, it could do with some de-duplication of definitions.
e205826
to
e51fac0
Compare
… config (#2362) This PR is an initial piece broken out of #2350. It adds a `wrapper` attribute to the `UserConfig` class that is used to determine which Python wrapper to import within the `main.py` run file. As of this PR, the only supported option is the default `"fv3gfs.wrapper"`, but in the abstract this adds a hook for us to ultimately also support `"shield.wrapper"`. Added public API: - A `wrapper` can now be specified within a prognostic run `UserConfig`. The only currently supported option is the default of `"fv3gfs.wrapper"`. Because of this default, this PR is fully backwards compatible (no need to change already-defined configs). Significant internal changes: - The Python wrapper is now imported within the `main` function defined in `main.py` within the prognostic run instead of the top of the module. - The wrapper is initialized within the `main` function, and thus after all top-level imports. Initial tests (including with a 2x2 layout in the cloud) suggest the odd issue that motivated the previous approach may be resolved. Perhaps this is related to the fact that we no longer use OpenMPI in the prognostic run image (we use MPICH). This logic had originally been added in #706. - The garbage collection step added in #1440 has also been removed, since it appears it is no longer needed. - [x] Tests added Coverage reports (updated automatically):
…#18) In FV3GFS, I included the diagnostics manager controlled physics diagnostics within the same data structure as the non-diagnostics-manager controlled physics diagnostics, while in SHiELD, I created a separate data structure for clarity. To enable getting diagnostics manager controlled diagnostics in the SHiELD-wrapper, some backend modifications are needed to work around this split. The frontend API remains identical, however. In the fv3net prognostic run, [we happen to always get a diagnostics manager controlled diagnostic within the time loop](https://github.com/ai2cm/fv3net/blob/92c808d13748f165d9fdaa8c0c7714f82f024eea/workflows/prognostic_c48_run/runtime/loop.py#L413-L415). This is not strictly necessary—we *could* introduce model specific logic in the time loop to work around this, as we did in ai2cm/fv3net#2350—but since this diagnostic does exist in SHiELD, we might as well enable getting it so that the fv3net prognostic run time loop (and testing logic) can remain as clean as possible. Note that this does require a minor upstream change in the SHiELD_physics repository to make some additional objects public: NOAA-GFDL/SHiELD_physics#33, which has now been merged. ### Backend changes - Subroutines `get_metadata_diagnostics`, `get_diagnostic_2d`, and `get_diagnostic_3d` within `coupler_lib.F90` now take an additional boolean argument denoting whether one would like to get data from a `diag_manager_controlled` diagnostic or not. - The dictionary returned by `_get_diagnostic_info` now uses tuple-valued keys `(module_name, name)` instead of integer keys, since the index no longer uniquely defines a diagnostic. `diag_manager_controlled` and `index` are now added as attributes to the `DiagnosticInfo` named tuple. Not only is this more general than what was implemented previously, it also makes for simpler querying of whether the requested diagnostic exists within `get_diagnostic_metadata_by_name`.
This PR adds a Dockerfile for building a `prognostic_run_shield` image, analogous to the `prognostic_run` image, but with SHiELD-wrapper instead of fv3gfs-wrapper installed. There are a few differences from the `prognostic_run` image: - GNU 10 compilers are used instead of GNU 9, since latest FMS / SHiELD require GNU 10+ compilers to build. - SHiELD-wrapper (and its dependencies) are installed instead of fv3gfs-wrapper. In terms of compiled dependencies, this happens under the hood within the `COMPILE` script in the SHiELD_build repo. - call_py_fort is omitted, since we have not made an effort to port the microphysics-emulation-specific call_py_fort hooks into SHiELD, which are somewhat orthogonal to the Python wrapper. To make things easier to review, for now I have just skipped any tests that require running prognostic simulations. A preview of the changes required for doing that with SHiELD-wrapper can be found in #2350 (nothing major within the time loop; most changes are needed to parametrize / configure tests). In a followup PR I will add the changes needed to actually test `shield.wrapper` simulations in the same way we test `fv3gfs.wrapper`, which will depend on this PR as well as the changes in #2362. Added public API: - Added a `prognostic_run_shield` image, which will eventually be used for prognostic runs using SHiELD-wrapper. Significant internal changes: - Added logic for skipping specific regression tests if `fv3gfs.wrapper` is not installed in the image. - Added an `authenticate` call to the `test_read_last_segment_gcs` test such that it can be run without relying on previous regression tests being run before it; thanks @frodre for help debugging this. Requirement changes: - Added a SHiELD-wrapper submodule, pointing to latest main.
This PR builds on #2365 and makes the changes necessary to add `predictor` and `nudging` regression tests for SHiELD-wrapper-based prognostic runs. In so doing it refactors `test_regression.py` to parametrize over the wrapper type, and move the base fortran configs into YAML files in their own subdirectory. Note that unlike in #2350 no changes are required to the time loop of the prognostic run, thanks to ai2cm/SHiELD-wrapper#18, which has now been merged upstream. I updated the SHiELD-wrapper submodule accordingly in this PR. Significant internal changes: - Added a `wrapper` parameter to `file_configs_to_namelist_settings`, since one of the namelist parameters used to control the output frequency of physics diagnostics in SHiELD is different than that in FV3GFS. - Moved the base fortran config for the FV3GFS prognostic run regression tests into a YAML file in a subdirectory alongside the base fortran config for the SHiELD prognostic run regression tests. - [x] Tests added
This PR builds on #2376 and splits out from #2350 what is necessary to run SHiELD-wrapper-based prognostic simulations through our standard prognostic run argo workflow. No changes to the frontend API are needed; the prognostic run workflow is modified to infer which template (`run-fv3gfs` or `run-shield`) to run based on the input config. For convenience this also adds a starter base config for SHiELD, which is based on the configuration used in the PIRE simulations (but for simplicity with the mixed layer ocean turned off). I have tested the `prognostic-run` and `restart-prognostic-run` workflows using a SHiELD-based config offline. I'm not sure if we want to add an integration test yet or not. Significant internal changes: - Refactored the `prognostic-run` workflow to infer whether to use FV3GFS or SHiELD based on the config. - Refactored the `restart-prognostic-run` workflow to infer whether to use FV3GFS or SHiELD based on the config at the provided URL. - Refactored the directory structure of the base config YAMLs in `fv3kube` to better accommodate SHiELD configs. No user-facing changes to the FV3GFS configs are made. Note this PR makes use of YAML anchors and aliases to reduce the amount of duplicate configuration code. Some illustration of how these work can be found [here](https://support.atlassian.com/bitbucket-cloud/docs/yaml-anchors/). Use of this concept was already introduced in #2103 within the `training.yaml` template, though this is the first time using it in the prognostic run. ----- To illustrate the updated workflows I have included some example step outputs from `argo get` below (we ran the `prognostic-run` workflow for two segments and then ran one more segment via the `restart-prognostic-run` workflow). ### prognostic-run ``` STEP TEMPLATE PODNAME DURATION MESSAGE ✔ 2023-12-21-baseline-shield-example prognostic-run ├───✔ resolve-output-url resolve-output-url/resolve-output-url 2023-12-21-baseline-shield-example-681218977 3s ├───✔ convert-input-config-to-artifact convert-input-config-to-artifact 2023-12-21-baseline-shield-example-117187868 3s ├───✔ infer-wrapper infer-wrapper 2023-12-21-baseline-shield-example-3326365873 3s ├───○ prepare-config-fv3gfs prepare-config-fv3gfs when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false ├───✔ prepare-config-shield prepare-config-shield 2023-12-21-baseline-shield-example-1872594915 3m ├───○ run-model-fv3gfs run-simulation/run-fv3gfs when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false ├───✔ run-model-shield run-simulation/run-shield │ ├─┬─✔ choose-node-pool choose-node-pool 2023-12-21-baseline-shield-example-3672275837 4s │ │ └─✔ create-run create-run-shield 2023-12-21-baseline-shield-example-1572199652 3m │ └───✔ run-first-segment run-all-segments-shield │ ├───✔ append-segment append-segment-shield 2023-12-21-baseline-shield-example-2948602435 6m │ ├───✔ increment-segment increment-count 2023-12-21-baseline-shield-example-2887248493 5s │ └───✔ run-next-segment run-all-segments-shield │ ├───✔ append-segment append-segment-shield 2023-12-21-baseline-shield-example-256387028 5m │ ├───✔ increment-segment increment-count 2023-12-21-baseline-shield-example-2030482824 3s │ └───○ run-next-segment run-all-segments-shield when '2 < 2' evaluated false ├───○ online-diags prognostic-run-diags/diagnostics when 'false == true' evaluated false ├───○ online-diags-report prognostic-run-diags/report-single-run when 'false == true' evaluated false └───○ exit exit when 'Skipped == Failed || Succeeded == Failed' evaluated false ``` ### restart-prognostic-run ``` STEP TEMPLATE PODNAME DURATION MESSAGE ✔ 2023-12-21-restart-baseline-shield-example restart-prognostic-run ├───✔ choose-node-pool run-simulation/choose-node-pool 2023-12-21-restart-baseline-shield-example-1554740548 3s ├───✔ infer-wrapper infer-wrapper 2023-12-21-restart-baseline-shield-example-3018678792 4s ├───○ restart-run-fv3gfs run-simulation/run-all-segments when ''shield.wrapper' == 'fv3gfs.wrapper'' evaluated false └───✔ restart-run-shield run-simulation/run-all-segments-shield ├───✔ append-segment append-segment-shield 2023-12-21-restart-baseline-shield-example-26558483 5m ├───✔ increment-segment increment-count 2023-12-21-restart-baseline-shield-example-4132074205 4s └───○ run-next-segment run-all-segments-shield when '1 < 1' evaluated false ```
This is a work in progress PR adding a SHiELD-wrapper-based prognostic run. I will add more details as I clean it up / flesh it out. Prognostic run tests run locally in the
prognostic_run_shield
image, but I have not hooked them into CI just yet. Some debugging still needs to be done to get all the tests passing reproducibly.Coverage reports (updated automatically):