[develop] First iteration of overhauling WE2E tests #686

mkavulich · 2023-03-21T03:47:38Z

DESCRIPTION OF CHANGES:

This is the first set of changes overhauling the WE2E test suites. This represents steps 1 and 2 of the overhaul process as described in #587 (comment). In addition, some quality-of-life improvements to the WE2E test scripts are included.

Major takeaways:

Comprehensive tests now take ~1 hour vs ~2 hours on Hera (depends on queue wait times), core hours reduced ~33% (~14k --> ~9k), all while increasing the test coverage of various capabilities.
Total number of unique tests reduced from to 93 to 71; 46 of which are included in the "comprehensive" suite.

To avoid cluttering this PR message, a more detailed summary of changes and the test savings can be found in this comment on issue 587: #587 (comment)

Test changes

Removed grid
- RRFS_SUBCONUS_3km
Removed tests
- grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional
- grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
- grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
- grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_GFS_v15p2
- community_ensemble_008mems
- custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_TRUE
- get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019101818
- get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2021010100
- get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019101818
- get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021010100
- get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2021062000
Combined tests (In the cases where one test was absorbed into another, a symlink remains for the old test)
- grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta + community_ensemble_2mems_stoch
- grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp + community_ensemble_2mems
- grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 + inline_post
- grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR + specify_DT_ATMOS_LAYOUT_XY_BLOCKSIZE
- grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta + specify_DOT_OR_USCORE
- grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 + specify_RESTART_INTERVAL
- grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR + MET_verification
- grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 + MET_ensemble_verification
Changed tests; Several tests were modified to run quicker and more efficiently:
- grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml Changed physics suite to more resolution-appropriate and cheaper FV3_GFS_v16, reduced forecast hours from 6 to 3
- grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR Changed physics suite to more resolution-appropriate and cheaper FV3_HRRR, reduced forecast hours from 6 to 3.
- grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 Reduced forecast hours from 6 to 3.
- grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 Saved ~50% core hours (~1100 -> ~600)
- grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta Reduced forecast hours from 6 to 3, increased DT_ATMOS to 40, increased OMP_NUM_THREADS_MAKE_OROG and NNODES_MAKE_ICS.
- Additionally, the SUBCONUS_Ind_3km-domain tests were re-arranged to ensure all input IC/LBC combinations are tested on a 3-km domain.

Testing script updates:

Track the start and end time in yaml file so the walltime is more accurate (ignoring user interruptions)
Print total walltime when experiment is ended
Refuse to allow ctrl-c interrupt while yaml file is being written (to avoid corrupted file)
When creating a summary file from an experiment directory, sort the list to make comparisons easier

Type of change

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

TESTS CONDUCTED:

Ran comprehensive and fundamental tests on Hera and Jet, as described above. Ran fundamental tests on Orion (intel).

DEPENDENCIES:

None

DOCUMENTATION:

WE2E test documentation will need updating, but will handle this in next round of changes.

ISSUE:

Partially addresses #587

CHECKLIST

My code follows the style guidelines in the Contributor's Guide
I have performed a self-review of my own code using the Code Reviewer's Guide
My changes do not require updates to the documentation (explain).
- As described above, will update documentation in next PR
My changes generate no new warnings
New and existing tests pass with my changes

- No need to test GFS_v15p2 on more than one 3km domain - Two RRFS_CONUScompact_25km tests were practically identical to others - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional is identical to grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot except for the plotting tasks at the end

test an old obscure functionality where the number of leading zeroes in NUM_ENS_MEMBERS would be carried over to the task names, but this hasn't been the case for a long time (ensemble numbers are padded to 3 digits regardless)

community_ensemble_2mems_stoch into a single test.

…dlmp and community_ensemble_2mems_stoch into a single test.

…NAMES_eq_TRUE

inline_post into a single test.

specify_DT_ATMOS_LAYOUT_XY_BLOCKSIZE into a single test; combine grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta and specify_DOT_OR_USCORE into a single test.

specify_RESTART_INTERVAL into a single test.

MET_verification into a single test; combine grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_HRRR and MET_ensemble_verification into a single test.

…rocessor count

- Track start time and total walltime in yaml file - Print total walltime of a given experiment when it is complete - Refuse user interrupt with ctrl-c if YAML file is being currently written (to avoid printing malformed YAML)

- Change config.grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml to use HRRR suite (cuts run time in half) - Remove known failure grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 - Fix grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot test name

to use RRFS_v1beta; also cuts runtime and core hours in half

… tasks now do that

…risons are easier

- Only run two cycles for grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp - Change physics suite for grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_HRRR to grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16

…NUM_THREADS_MAKE_OROG, increase NNODES_MAKE_ICS

MichaelLueken

@mkavulich Thanks for creating the first iteration of changes related to overhauling the WE2E tests!

Overall, these changes look good to me, however, I did note that some of the changes, especially to either the physics suite being used or ICs/LBCs being derived weren't updated in the descriptions. I have noted these issues when I found them in my review.

...n_mdls_suites_community/config.grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta.yaml

...extrn_mdls_suites_community/config.grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16.yaml

..._extrn_mdls_suites_community/config.grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16.yaml

...s_extrn_mdls_suites_community/config.grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR.yaml

...uites_community/config.grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta.yaml

mkavulich · 2023-03-21T18:12:18Z

@MichaelLueken Thanks for your comments, I have made all those necessary changes.

An update on testing: I did not re-run the fundamental tests after incorporating changes from develop until now. When I did, I found that verification tests were failing on Hera (since fundamental tests on Hera are run in NCO mode); this is a known issue described in issue #688

MichaelLueken

@mkavulich Thank you very much for addressing my comments! I will go ahead and approve this work now.

Please note that neither Jenkins nor I are able to connect to Gaea right now, so the tests will likely take some time to complete.

MichaelLueken · 2023-03-22T13:38:25Z

@mkavulich The Jenkins tests have passed for all systems, except for Gaea (still down) and Cheyenne Intel, where the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 test failed with a status of DEAD.

mkavulich · 2023-03-22T17:29:23Z

@MichaelLueken Can you point me to the location of failing tests on Cheyenne? I suspect this is due to missing data in the old staged-data location since #674 has not been merged yet.

MichaelLueken · 2023-03-22T18:35:49Z

@mkavulich Sure, the directory on Cheyenne is:

/glade/scratch/epicufsrt/jenkins/workspace/fs-srweather-app_pipeline_PR-686__2/expt_dirs/grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16

mkavulich · 2023-03-22T19:48:34Z

Okay good, it is indeed a failure due to missing data as I suspected. I will talk to @natalie-perlin to see about getting that data moved to the new data locations.

MichaelLueken · 2023-03-24T16:54:23Z

@mkavulich The Gaea test logs were manually reviewed and all successfully passed. Do you know whether the necessary data has been moved to the new data locations? Once Jenkins comes back up, I can resubmit the Cheyenne Intel suite. Thanks!

mkavulich · 2023-03-24T21:06:58Z

@MichaelLueken I am still waiting on one set of observations to be staged so all verification tests can pass. I will let you know when they are in place.

…_WE2E_test_suites

mkavulich · 2023-03-27T19:32:53Z

@MichaelLueken I have merged the latest changes and it looks like all the data is in place that needs to be. I think I'm ready for you to kick off the Jenkins workflow now.

MichaelLueken · 2023-03-27T19:42:51Z

@mkavulich Thanks! I have resubmitted the Jenkins tests now.

mkavulich · 2023-03-29T16:28:10Z

@MichaelLueken I see from other PR discussions that the "aborted" message is a Jenkins pipeline problem communicating with Jet. I can see by checking manually that all tests passed on platforms I have access to (Cheyenne, Orion, Jet) and Hera definitely passes since that's where I ran all my manual tests. Can this be merged?

MichaelLueken · 2023-03-29T16:32:43Z

@mkavulich Yes, the Jenkins tests successfully passed and my own manual tests on Jet also passed. There are no XML file changes, so I will move forward with merging this work now.

Update the WE2E test used in srw_metric_example.sh to reflect the changes that the WE2E test "grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16" has been substituted with "grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0" in #686 (The GFS v16 suite is not intended for use at such high resolutions, so using the WoFS_v0 suite was more appropriate in this case).

mkavulich added 22 commits March 20, 2023 16:42

Give SUBCONUS_Ind_3km grid more varied ICs/LBCs

542c473

Remove "RRFS_SUBCONUS_3km" domain

1ad2551

Combine grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR and

49f7f68

community_ensemble_2mems_stoch into a single test.

Combine grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gf…

e4acd15

…dlmp and community_ensemble_2mems_stoch into a single test.

Remove redundant test custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILE…

38660e9

…NAMES_eq_TRUE

Remove a lot of redundant "get_from_HPSS_" tests

50c869d

Combine grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 and

4fe42af

inline_post into a single test.

Combine grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR and

aca327f

specify_DT_ATMOS_LAYOUT_XY_BLOCKSIZE into a single test; combine grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta and specify_DOT_OR_USCORE into a single test.

Combine grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 and

e5c814a

specify_RESTART_INTERVAL into a single test.

Combine grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR and

23dfc7d

MET_verification into a single test; combine grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_HRRR and MET_ensemble_verification into a single test.

Pare down long-running tasks; decrease fcst runtime and/or increase p…

c879102

…rocessor count

Updates to fix a few failed test updates

1d1e491

Some QOL improvements to python scripts

4b0383b

- Track start time and total walltime in yaml file - Print total walltime of a given experiment when it is complete - Refuse user interrupt with ctrl-c if YAML file is being currently written (to avoid printing malformed YAML)

Also update grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16

51b2ed5

to use RRFS_v1beta; also cuts runtime and core hours in half

Don't report total walltime at end of monitor script since individual…

c303c9a

… tasks now do that

When reading an existing experiment directory, sort the list so compa…

fbf1d1f

…risons are easier

Fix links broken in rebase

59dbffb

Cheapen some consolidated tests

723763c

- Only run two cycles for grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp - Change physics suite for grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_HRRR to grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16

Speed up 3km NA test a little: increase DT_ATMOS to 40, increase OMP_…

4da3f1e

…NUM_THREADS_MAKE_OROG, increase NNODES_MAKE_ICS

mkavulich requested review from gsketefian, JeffBeck-NOAA, RatkoVasic-NOAA, BenjaminBlake-NOAA, ywangwof, chan-hoo, panll and christinaholtNOAA as code owners March 21, 2023 03:47

mkavulich requested review from dmwright526, gspetro-NOAA, natalie-perlin, EdwardSnyder-NOAA and MichaelLueken as code owners March 21, 2023 03:47

mkavulich changed the title ~~Feature/overhaul we2 e test suites~~ [develop] First iteration of overhauling WE2E tests Mar 21, 2023

mkavulich mentioned this pull request Mar 21, 2023

Overhaul and consolidate WE2E tests, identify needed additional tests #587

Closed

MichaelLueken reviewed Mar 21, 2023

View reviewed changes

Correct test descriptions from Mike Leuken

8cedb03

MichaelLueken approved these changes Mar 22, 2023

View reviewed changes

MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Mar 22, 2023

chan-hoo approved these changes Mar 22, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/develop' into feature/overhaul…

3613028

…_WE2E_test_suites

MichaelLueken merged commit bd80d94 into ufs-community:develop Mar 29, 2023

This was referenced Mar 29, 2023

[develop] Bug fix for broken symlink in verification test directory #704

Closed

run_post fails for NA_3km domain #705

Closed

mkavulich mentioned this pull request Apr 17, 2023

Test specify_DOT_OR_USCORE fails with GNU compiler #344

Closed

clouden90 mentioned this pull request Aug 17, 2023

[develop] Feature/metric update #890

Merged

33 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[develop] First iteration of overhauling WE2E tests #686

[develop] First iteration of overhauling WE2E tests #686

mkavulich commented Mar 21, 2023 •

edited

Loading

MichaelLueken left a comment

mkavulich commented Mar 21, 2023 •

edited

Loading

MichaelLueken left a comment

MichaelLueken commented Mar 22, 2023

mkavulich commented Mar 22, 2023

MichaelLueken commented Mar 22, 2023

mkavulich commented Mar 22, 2023

MichaelLueken commented Mar 24, 2023

mkavulich commented Mar 24, 2023

mkavulich commented Mar 27, 2023

MichaelLueken commented Mar 27, 2023

mkavulich commented Mar 29, 2023

MichaelLueken commented Mar 29, 2023

[develop] First iteration of overhauling WE2E tests #686

[develop] First iteration of overhauling WE2E tests #686

Conversation

mkavulich commented Mar 21, 2023 • edited Loading

DESCRIPTION OF CHANGES:

Major takeaways:

Test changes

Testing script updates:

Type of change

TESTS CONDUCTED:

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

MichaelLueken left a comment

Choose a reason for hiding this comment

mkavulich commented Mar 21, 2023 • edited Loading

MichaelLueken left a comment

Choose a reason for hiding this comment

MichaelLueken commented Mar 22, 2023

mkavulich commented Mar 22, 2023

MichaelLueken commented Mar 22, 2023

mkavulich commented Mar 22, 2023

MichaelLueken commented Mar 24, 2023

mkavulich commented Mar 24, 2023

mkavulich commented Mar 27, 2023

MichaelLueken commented Mar 27, 2023

mkavulich commented Mar 29, 2023

MichaelLueken commented Mar 29, 2023

mkavulich commented Mar 21, 2023 •

edited

Loading

mkavulich commented Mar 21, 2023 •

edited

Loading