Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Adding support for FV3_RAP CCPP suite #811

Merged

Conversation

mkavulich
Copy link
Collaborator

DESCRIPTION OF CHANGES:

The release committee has decided on including FV3_RAP as a supported CCPP suite in the SRW for the next release. This PR introduces the necessary changes to the workflow, documentation, and tests to add FV3_RAP as a supported suite in the app.

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

TESTS CONDUCTED:

Three tests are updated to use the new FV3_RAP suite:

  • grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 --> grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP
    • In addition, stochastic physics switches are added for this test. I ensured small perturbations existed in the initial output compared to the non-SPP runs as expected.
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR --> grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP
  • grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta --> grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP

Ran fundamental test suite on Hera, and comprehensive suite on Jet. All tests (including the above changes) passed except for three:

  • MET_verification_only_vx
  • grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR
  • GST_release_public_v1

The first failure is due to missing data; this is documented in #808. The latter two are due to the updated weather model hash: this may require further changes to either namelist or config files. See this discussion on PR 799 for more info. For now I have added the "DO_NOT_MERGE" label until these failures can be resolved, but note that this is unrelated to the FV3_RAP suite.

  • hera.intel
    • fundamental test suite
  • jet.intel
    • comprehensive tests

DEPENDENCIES:

DOCUMENTATION:

Updated users guide to include references to new supported suite

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich The changes look good to me!

I ran the coverage tests on Orion and all tests successfully passed. Coverage tests have also been submitted on Cheyenne GNU (wanted to make sure that at least one of the new RAP tests was run as part of my testing). All tests, with the exception of nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km, passed on Cheyenne GNU as well. The grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP test was also ran on Cheyenne GNU and it successfully passed.

Approving this PR.

@mkavulich
Copy link
Collaborator Author

mkavulich commented Jun 6, 2023

Thanks for the approval @MichaelLueken. Once #799 is merged I will rebase this branch on the latest changes from develop and resolve the resulting conflicts.

@MichaelLueken
Copy link
Collaborator

@mkavulich PR #799 hasn't been merged yet. I'm now investigating a new failure of the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot test on Cheyenne GNU. It wasn't an issue previously, but now the plot_allvars step is failing with:

ERROR: TopologyException: side location conflict at -97.608377902731846 34.009777791998118
INFO: Self-intersection at or near point -97.878776042255282 34.00918460186481

None of the change in PR #799 would affect this test, but that PR is the only one where this test is currently failing now.

Copy link
Collaborator

@EdwardSnyder-NOAA EdwardSnyder-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the new RAP suite WE2E tests on Jet and they all passed. Code checks out as well. Approving.

@MichaelLueken
Copy link
Collaborator

@mkavulich PR #799 has been successfully merged. Please rebase this branch on the latest changes from develop and resolve the resulting conflicts. Once complete, I will run the automated Jenkins tests and then move forward with merging this work. Thanks!

@mkavulich
Copy link
Collaborator Author

@MichaelLueken Thanks for the heads up, I rebased the commits and ran the Hera Intel coverage suite and all tests passed. This PR is ready to be re-tested.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Jun 14, 2023
@MichaelLueken
Copy link
Collaborator

@mkavulich The coverage WE2E tests were manually ran on Orion earlier this morning and all tests successfully passed.

The Jenkins tests are failing in the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 test on Jet. Specifically, I see:

FATAL from PE 1: compute_qs: saturation vapor pressure table overflow, nbad= 1

in the log files. This wasn't an issue from testing PR #799 in Jenkins (nor any of the other changes that have been tested since it was merged). Running your changes manually, this test is successfully passing for me.

The correct DT_ATMOS value is being used (150), so this shouldn't be an issue. Please see:

/mnt/lfs4/HFIP/hfv3gfs/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-811/expt_dirs/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2

for the results of the failing test.

Additionally, the grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR test is failing while manually running the tests (this test is passing in Jenkins), with the following error message:

slurmstepd: error: Detected 2 oom-kill event(s) in StepId=29406268.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: s40: task 526: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=29406268.0
slurmstepd: error: *** STEP 29406268.0 ON s3 CANCELLED AT 2023-06-14T20:07:33 ***

If you could take a look at these two tests, I would be greatly appreciative. Thanks!

@mkavulich
Copy link
Collaborator Author

@MichaelLueken I have no insight into those errors. I re-ran the coverage tests on Jet and all were successful. Perhaps this was a random system hiccup? Can we kick off the automated tests again to see if they succeed?

If this sort of thing continues to happen on Jet specifically we may want to investigate if the model is having different behavior on certain partitions. If this is the case we may need to modify the failing test(s) to request a certain amount of memory. This page details the different specs for the various Jet partitions; I wonder if the failing jobs landed on xJet, which has the least memory per node.

@mkavulich mkavulich removed the DO_NOT_MERGE Ensure that a PR isn't merged label Jun 14, 2023
@MichaelLueken
Copy link
Collaborator

@mkavulich The rerun this morning on Jet has successfully passed. Moving forward with the merging this PR now.

@MichaelLueken MichaelLueken merged commit 54bd939 into ufs-community:develop Jun 15, 2023
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Add SRW support for FV3_RAP suite
3 participants