Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the remaining UFS Case Studies #1081

Merged

Conversation

EdwardSnyder-NOAA
Copy link
Collaborator

DESCRIPTION OF CHANGES:

This PR adds the remaining UFS Case Studies to the SRW App as WE2E tests. These new tests were added to the comprehensive and coverage files as well. Please note that Hurricane Michael's initial and boundary conditions are too old and doesn't contain enough data to be used by the current physics suites, resulting in a failure during the make_ics step.

The pre-existing UFS Case Study WE2E tests were modified to account for an increase in compute resources for the get_extrn_lbcs step. Adding these resources cut the get_extrn_lbcs run time in half, to under two hours.

These tests ran on PW AWS and can be found here: /contrib/Edward.Snyder/ufs-case-studies/all/expt_dirs

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • hercules.intel
  • cheyenne.intel
  • cheyenne.gnu
  • derecho.intel
  • gaea.intel
  • gaeac5.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform) PW AWS
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

2020 Easter Sunday Storm
Wind forecast with the FV3_GFS_v16 physics suite matches well with the RAP analysis used in the case study.
10mwind_conus_f072

2019 Memorial Day Heat Wave
The temperature forecast with the FV3_GFS_v16 physics suite looks to have a reduced warm bias compared to the SRW_GFSv15p2 suite from the case study.
2mt_conus_f090

2020 January Cold Blast
Using the FV3_GFS_v16 physics suite, it appears the cold bias was reduced compared to the physics suites used in the case study.
2mt_conus_f072

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdwardSnyder-NOAA -

These changes look good to me! I was also able to successfully run the new WE2E tests on Hera:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
2019_hurricane_lorenzo_20240509140026                              COMPLETE             109.80
2019_memorial_day_heat_wave_20240509140027                         COMPLETE              97.81
2020_denver_radiation_inversion_20240509140027                     COMPLETE             110.90
2020_easter_storm_20240509140028                                   COMPLETE             149.43
2020_jan_cold_blast_20240509140029                                 COMPLETE             109.92
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             577.86

I will be holding off on submitting the Jenkins tests until the HPSS library maintenance has concluded. I will be running them first thing in the morning.

Approving now.

@RatkoVasic-NOAA
Copy link
Collaborator

@EdwardSnyder-NOAA all ufs_case_studies are failing on Hercules and Orion in get_extrn_lbcs part. When we use partition=service in job card, Hercules and Orion are limited to using only 1 node (and they refuse to submit task).
Since this job is only getting data (correct me if I'm wrong), we have no advantage in using more than one node. Do you recall any reason for putting nnodes: 2 in tests/WE2E/test_configs/ufs_case_studies/*.yaml files?
When I manually changed in FV3LAM_wflow.xml from <nodes>2:ppn=24</nodes> to <nodes>1:ppn=24</nodes> it worked OK.

@EdwardSnyder-NOAA
Copy link
Collaborator Author

@RatkoVasic-NOAA - I added an extra node for the get_extrn_lbcs task, so that the job will finish quicker. These tests pull large tar nemsio files from the AWS S3 bucket. The whole process of fetching and un-tarring files takes up to 4 hours on a single node when running the full case experiment (forecast hour set to 90). Adding an additional node cuts the runtime in half. I wasn't aware of these compute limitations on the other T1 platforms, so I can remove the extra node.

@RatkoVasic-NOAA
Copy link
Collaborator

@RatkoVasic-NOAA - I added an extra node for the get_extrn_lbcs task, so that the job will finish quicker. These tests pull large tar nemsio files from the AWS S3 bucket. The whole process of fetching and un-tarring files takes up to 4 hours on a single node when running the full case experiment (forecast hour set to 90). Adding an additional node cuts the runtime in half. I wasn't aware of these compute limitations on the other T1 platforms, so I can remove the extra node.

Great, I'm running just those tests on Orion and Hercules now.

@RatkoVasic-NOAA
Copy link
Collaborator

Selected tests passed on Orion:

Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
2019_halloween_storm_20240513095113                                COMPLETE              70.05
2019_hurricane_barry_20240513095115                                COMPLETE              69.63
2019_hurricane_lorenzo_20240513095116                              COMPLETE              70.95
2019_memorial_day_heat_wave_20240513095117                         COMPLETE              67.03
2020_CAD_20240513095117                                            COMPLETE              68.03
2020_CAPE_20240513095118                                           COMPLETE              69.18
2020_denver_radiation_inversion_20240513095119                     COMPLETE              69.26
2020_easter_storm_20240513095120                                   COMPLETE              70.33
2020_jan_cold_blast_20240513095120                                 COMPLETE              72.68
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             627.14

and Hercules:

Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
2019_halloween_storm_20240513095157                                COMPLETE             403.56
2019_hurricane_barry_20240513095158                                COMPLETE             397.70
2019_hurricane_lorenzo_20240513095159                              COMPLETE              43.51
2019_memorial_day_heat_wave_20240513095200                         COMPLETE              41.52
2020_CAD_20240513095200                                            COMPLETE              43.03
2020_CAPE_20240513095201                                           COMPLETE             415.92
2020_denver_radiation_inversion_20240513095202                     COMPLETE              45.34
2020_easter_storm_20240513095202                                   COMPLETE              43.06
2020_jan_cold_blast_20240513095203                                 COMPLETE              44.40
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1478.04

Approving.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label May 13, 2024
@MichaelLueken
Copy link
Collaborator

MichaelLueken commented May 14, 2024

The get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h test on Jet failed. The test failed in the make_lbcs task with the following error:

slurmstepd: error: Detected 1 oom_kill event in StepId=3675877.0. Some of the step tasks have been OOM Killed.
srun: error: s40: task 0: Out Of Memory
srun: Terminating StepId=3675877.0

Once Jet returns from maintenance, I will attempt to rerun the test. Once it passes, I will be able to move forward with merging this work.

@RatkoVasic-NOAA
Copy link
Collaborator

Once Jet returns from maintenance, I will attempt to rerun the test. Once it passes, I will be able to move forward with merging this work.

HPSS is on maintenance as well, until 10PM EDT.

@MichaelLueken
Copy link
Collaborator

The rerun of the get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h WE2E test on Jet this morning successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240513175643                                           COMPLETE              18.43
custom_ESGgrid_20240513175644                                      COMPLETE              26.57
custom_ESGgrid_Great_Lakes_snow_8km_20240513175645                 COMPLETE              20.27
custom_GFDLgrid_20240513175647                                     COMPLETE              11.34
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202405  COMPLETE               8.85
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              84.74
get_from_HPSS_ics_RAP_lbcs_RAP_20240513175650                      COMPLETE              16.45
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240513175651  COMPLETE             606.36
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              66.45
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               8.77
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             926.06
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1794.29

With this, all of the coverage tests have successfully completed.

Moving forward with merging this PR now.

@MichaelLueken MichaelLueken merged commit 59c78fb into ufs-community:develop May 15, 2024
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants