Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] First iteration of overhauling WE2E tests #686

Merged

Conversation

mkavulich
Copy link
Collaborator

@mkavulich mkavulich commented Mar 21, 2023

DESCRIPTION OF CHANGES:

This is the first set of changes overhauling the WE2E test suites. This represents steps 1 and 2 of the overhaul process as described in #587 (comment). In addition, some quality-of-life improvements to the WE2E test scripts are included.

Major takeaways:

  • Comprehensive tests now take ~1 hour vs ~2 hours on Hera (depends on queue wait times), core hours reduced ~33% (~14k --> ~9k), all while increasing the test coverage of various capabilities.
  • Total number of unique tests reduced from to 93 to 71; 46 of which are included in the "comprehensive" suite.

To avoid cluttering this PR message, a more detailed summary of changes and the test savings can be found in this comment on issue 587: #587 (comment)

Test changes

  • Removed grid
    • RRFS_SUBCONUS_3km
  • Removed tests
    • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional
    • grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
    • grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
    • grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_GFS_v15p2
    • community_ensemble_008mems
    • custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_TRUE
    • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019101818
    • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2021010100
    • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019101818
    • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021010100
    • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2021062000
  • Combined tests (In the cases where one test was absorbed into another, a symlink remains for the old test)
    • grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta + community_ensemble_2mems_stoch
    • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp + community_ensemble_2mems
    • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 + inline_post
    • grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR + specify_DT_ATMOS_LAYOUT_XY_BLOCKSIZE
    • grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta + specify_DOT_OR_USCORE
    • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 + specify_RESTART_INTERVAL
    • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR + MET_verification
    • grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 + MET_ensemble_verification
  • Changed tests; Several tests were modified to run quicker and more efficiently:
    • grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml Changed physics suite to more resolution-appropriate and cheaper FV3_GFS_v16, reduced forecast hours from 6 to 3
    • grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR Changed physics suite to more resolution-appropriate and cheaper FV3_HRRR, reduced forecast hours from 6 to 3.
    • grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 Reduced forecast hours from 6 to 3.
    • grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 Saved ~50% core hours (~1100 -> ~600)
    • grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta Reduced forecast hours from 6 to 3, increased DT_ATMOS to 40, increased OMP_NUM_THREADS_MAKE_OROG and NNODES_MAKE_ICS.
    • Additionally, the SUBCONUS_Ind_3km-domain tests were re-arranged to ensure all input IC/LBC combinations are tested on a 3-km domain.

Testing script updates:

  • Track the start and end time in yaml file so the walltime is more accurate (ignoring user interruptions)
  • Print total walltime when experiment is ended
  • Refuse to allow ctrl-c interrupt while yaml file is being written (to avoid corrupted file)
  • When creating a summary file from an experiment directory, sort the list to make comparisons easier

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

Ran comprehensive and fundamental tests on Hera and Jet, as described above. Ran fundamental tests on Orion (intel).

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

None

DOCUMENTATION:

WE2E test documentation will need updating, but will handle this in next round of changes.

ISSUE:

Partially addresses #587

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • My changes do not require updates to the documentation (explain).
    • As described above, will update documentation in next PR
  • My changes generate no new warnings
  • New and existing tests pass with my changes

 - No need to test GFS_v15p2 on more than one 3km domain
 - Two RRFS_CONUScompact_25km tests were practically identical to others
 - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional
   is identical to grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
   except for the plotting tasks at the end
test an old obscure functionality where the number of leading zeroes in
NUM_ENS_MEMBERS would be carried over to the task names, but this hasn't
been the case for a long time (ensemble numbers are padded to 3 digits
regardless)
community_ensemble_2mems_stoch into a single test.
…dlmp

and community_ensemble_2mems_stoch into a single test.
specify_DT_ATMOS_LAYOUT_XY_BLOCKSIZE into a single test; combine
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta and
specify_DOT_OR_USCORE into a single test.
MET_verification into a single test; combine
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_HRRR and
MET_ensemble_verification into a single test.
 - Track start time and total walltime in yaml file
 - Print total walltime of a given experiment when it is complete
 - Refuse user interrupt with ctrl-c if YAML file is being currently
written (to avoid printing malformed YAML)
 - Change config.grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16.yaml
   to use HRRR suite (cuts run time in half)
 - Remove known failure grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16
 - Fix grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
   test name
to use RRFS_v1beta; also cuts runtime and core hours in half
 - Only run two cycles for
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp
 - Change physics suite for
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_HRRR to
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16
…NUM_THREADS_MAKE_OROG, increase NNODES_MAKE_ICS
@mkavulich mkavulich changed the title Feature/overhaul we2 e test suites [develop] First iteration of overhauling WE2E tests Mar 21, 2023
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Thanks for creating the first iteration of changes related to overhauling the WE2E tests!

Overall, these changes look good to me, however, I did note that some of the changes, especially to either the physics suite being used or ICs/LBCs being derived weren't updated in the descriptions. I have noted these issues when I found them in my review.

@mkavulich
Copy link
Collaborator Author

mkavulich commented Mar 21, 2023

@MichaelLueken Thanks for your comments, I have made all those necessary changes.

An update on testing: I did not re-run the fundamental tests after incorporating changes from develop until now. When I did, I found that verification tests were failing on Hera (since fundamental tests on Hera are run in NCO mode); this is a known issue described in issue #688

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Thank you very much for addressing my comments! I will go ahead and approve this work now.

Please note that neither Jenkins nor I are able to connect to Gaea right now, so the tests will likely take some time to complete.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Mar 22, 2023
@MichaelLueken
Copy link
Collaborator

@mkavulich The Jenkins tests have passed for all systems, except for Gaea (still down) and Cheyenne Intel, where the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 test failed with a status of DEAD.

@mkavulich
Copy link
Collaborator Author

@MichaelLueken Can you point me to the location of failing tests on Cheyenne? I suspect this is due to missing data in the old staged-data location since #674 has not been merged yet.

@MichaelLueken
Copy link
Collaborator

@mkavulich Sure, the directory on Cheyenne is:

/glade/scratch/epicufsrt/jenkins/workspace/fs-srweather-app_pipeline_PR-686__2/expt_dirs/grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16

@mkavulich
Copy link
Collaborator Author

Okay good, it is indeed a failure due to missing data as I suspected. I will talk to @natalie-perlin to see about getting that data moved to the new data locations.

@MichaelLueken
Copy link
Collaborator

@mkavulich The Gaea test logs were manually reviewed and all successfully passed. Do you know whether the necessary data has been moved to the new data locations? Once Jenkins comes back up, I can resubmit the Cheyenne Intel suite. Thanks!

@mkavulich
Copy link
Collaborator Author

@MichaelLueken I am still waiting on one set of observations to be staged so all verification tests can pass. I will let you know when they are in place.

@mkavulich
Copy link
Collaborator Author

@MichaelLueken I have merged the latest changes and it looks like all the data is in place that needs to be. I think I'm ready for you to kick off the Jenkins workflow now.

@MichaelLueken
Copy link
Collaborator

@mkavulich Thanks! I have resubmitted the Jenkins tests now.

@mkavulich
Copy link
Collaborator Author

@MichaelLueken I see from other PR discussions that the "aborted" message is a Jenkins pipeline problem communicating with Jet. I can see by checking manually that all tests passed on platforms I have access to (Cheyenne, Orion, Jet) and Hera definitely passes since that's where I ran all my manual tests. Can this be merged?

@MichaelLueken
Copy link
Collaborator

@mkavulich Yes, the Jenkins tests successfully passed and my own manual tests on Jet also passed. There are no XML file changes, so I will move forward with merging this work now.

@MichaelLueken MichaelLueken merged commit bd80d94 into ufs-community:develop Mar 29, 2023
@clouden90 clouden90 mentioned this pull request Aug 17, 2023
33 tasks
MichaelLueken pushed a commit that referenced this pull request Aug 18, 2023
Update the WE2E test used in srw_metric_example.sh to reflect the changes that the WE2E test "grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16" has been substituted with "grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0" in #686 (The GFS v16 suite is not intended for use at such high resolutions, so using the WoFS_v0 suite was more appropriate in this case).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants