Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Jet switch from CentOS to Rocky #1045

Merged
merged 19 commits into from
Mar 13, 2024

Conversation

RatkoVasic-NOAA
Copy link
Collaborator

DESCRIPTION OF CHANGES:

Jet is switching from CentOS to Rocky OS.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • hercules.intel
  • cheyenne.intel
  • cheyenne.gnu
  • derecho.intel
  • gaea.intel
  • gaeac5.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests

ISSUE:

Solves issue #1044

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

@MichaelLueken MichaelLueken changed the title Jet switch from CentOS to Rocky [develop] Jet switch from CentOS to Rocky Feb 29, 2024
@MichaelLueken MichaelLueken linked an issue Feb 29, 2024 that may be closed by this pull request
@MichaelLueken MichaelLueken added the enhancement New feature or request label Feb 29, 2024
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RatkoVasic-NOAA -

The fundamental tests were run on Jet Rocky8 and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              10.56
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              14.39
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.49
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              16.27
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024022  COMPLETE              34.57
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240229185  COMPLETE              25.68
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024022918551  COMPLETE              23.57
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             132.53

Approving now.

@MichaelLueken
Copy link
Collaborator

The fundamental tests were also successfully run on Jet using CentOS:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE               9.10
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              15.51
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               8.28
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              16.07
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024022  COMPLETE              27.90
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240229203  COMPLETE              21.75
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024022920365  COMPLETE              21.27
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             119.88

@EdwardSnyder-NOAA
Copy link
Collaborator

Built the SRW App on Rocky 8 using the changes from this PR and ensured the changes worked by running this case: /lfs4/HFIP/hfv3gfs/Edward.Snyder/PR_1045/expt_dirs/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Mar 1, 2024
@natalie-perlin
Copy link
Collaborator

Fundamental tests ran successfully on Jet (xjet):

All 7 experiments finished
Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE               9.90
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              13.67
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.12
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              16.18
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024030  COMPLETE              30.38
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240301215  COMPLETE              22.14
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024030121531  COMPLETE              22.77
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             122.16

Detailed summary written to /mnt/lfs4/HFIP/hfv3gfs/Natalie.Perlin/SRW/expt_dirs/WE2E_summary_20240301223112.txt

Copy link
Collaborator

@natalie-perlin natalie-perlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the tests, with the following changes made to allow testing for Rocky 8 OS:

modulefiles/build_jet_intel.lua:
uncommented
prepend_path("MODULEPATH","/mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.0/envs/unified-env-rocky8/install/modulefiles/Core")

commented out
prepend_path("MODULEPATH","/mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.0/envs/unified-env/install/modulefiles/Core")

In ./ush/machine/jet.yaml, set the all the partitions to xjet:

PARTITION_DEFAULT: xjet
 ...
PARTITION_FCST: xjet

@MichaelLueken
Copy link
Collaborator

The Hera Jenkins tests failed due to the system coming down yesterday for maintenance. These tests have been requeued.

There was also a failure on Jet. The get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h test failed in make_lbcs with an OOM error. Using rocotorewind/rocotoboot allowed this test to pass:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
community_20240304152101                                           COMPLETE              21.59
custom_ESGgrid_20240304152102                                      COMPLETE              18.35
custom_ESGgrid_Great_Lakes_snow_8km_20240304152104                 COMPLETE              13.40
custom_GFDLgrid_20240304152106                                     COMPLETE               9.45
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202403  COMPLETE              10.26
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              49.66
get_from_HPSS_ics_RAP_lbcs_RAP_20240304152110                      COMPLETE              15.30
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240304152111  COMPLETE             222.35
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              43.97
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               9.64
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             533.34
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024  COMPLETE              10.62
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             957.93

Once the Hera tests complete, this PR can be merged.

@MichaelLueken
Copy link
Collaborator

The Hera Intel tests were run on Rocky8 and all tests passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Peru_12km_20240308143348                            COMPLETE              18.07
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024030  COMPLETE               6.05
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             766.89
get_from_HPSS_ics_HRRR_lbcs_RAP_20240308143351                     COMPLETE              14.39
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               5.96
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              12.73
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240308143354  COMPLETE              10.19
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE               6.22
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202403  COMPLETE             235.54
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240308  COMPLETE             313.52
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202403081  COMPLETE             328.98
pregen_grid_orog_sfc_climo_20240308143359                          COMPLETE               7.09
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1725.63

@MichaelLueken
Copy link
Collaborator

@RatkoVasic-NOAA -

Unfortunately, while running the WE2E tests with Rocky8 on Hera GNU, the issue that you noted during the UFS apps and components coordination meeting showed up - all tests are failing due to using srun and not being able to find libpmi.so.0 and libpmi2.so.0.

We will need to hope that the tests are able to run over the weekend on CentOS and no longer set in queue.

@MichaelLueken
Copy link
Collaborator

@RatkoVasic-NOAA -

Given that Hera GNU tests are just sitting in queue for days and the inability to run Hera GNU on Rocky8, the successful run of the Hera Intel and the rest of the platforms will be enough to get this work merged.

Since Rocky8 will be the default package of the nodes following today's update, I will go ahead and set the spack-stack path to point at the rocky8 location and change the ush/machine/jet.yaml file to use xJet for the forecast tasks. Once Jet is returned, Kris Booker and I will check to ensure that the Jet runner is using one of the Rocky8 front ends, then I will run the Jet tests one last time. Once complete, this PR will get merged.

….lua and set PARTITION_FCST=xjet in ush/machine/jet.yaml
@MichaelLueken
Copy link
Collaborator

The rerun of the Jenkins tests on Jet had one failure, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2. The run_fcst task was failing with:

FATAL from PE 1: compute_qs: saturation vapor pressure table overflow, nbad= 1

None of the changes made in this PR will cause this issue. The use of rocotorewind/rocotoboot allowed the failed task to successfully pass:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240312211355                                           COMPLETE              19.64
custom_ESGgrid_20240312211357                                      COMPLETE              18.79
custom_ESGgrid_Great_Lakes_snow_8km_20240312211358                 COMPLETE              14.27
custom_GFDLgrid_20240312211400                                     COMPLETE              10.02
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202403  COMPLETE              11.20
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              57.17
get_from_HPSS_ics_RAP_lbcs_RAP_20240312211404                      COMPLETE              17.22
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240312211405  COMPLETE             223.35
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              40.85
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.38
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             496.47
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024  COMPLETE              10.68
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             927.04

Moving forward with merging this PR now.

@MichaelLueken MichaelLueken merged commit 6e6a27f into ufs-community:develop Mar 13, 2024
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test SRW app with new OS on Jet (Rocky 8)
4 participants