Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Add Gaea C5 to supported platforms #898

Merged
merged 20 commits into from
Sep 27, 2023

Conversation

natalie-perlin
Copy link
Collaborator

@natalie-perlin natalie-perlin commented Sep 5, 2023

Modulefiles and other configuration files that complete porting the SRW to Gaea C5 system.

Software stacks used for testing are hdf5/1.14.0, netcdf/4.9.2-based, similar to those used in #889.

DESCRIPTION OF CHANGES:

Add Gaea C5 at GFDL as NOAA RDHPCS supported system

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

All fundamental tests pass successfully on Gaea_c5
All comprehensive tests pass except one (nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR) that fails on some other platforms as well.

  • hera.intel
  • orion.intel
  • gaea_c5.intel
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins

DEPENDENCIES:

Depends on #889 - MERGED

DOCUMENTATION:

ISSUE:

Fixes issue #886

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

@RatkoVasic-NOAA - thank you for you contribution!!

WE2E_gaea_c5_fundamental_summary.txt

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natalie-perlin - I was able to successfully clone your fork, checkout the develop_gaea_c5 branch, build the SRW App, and submit the fundamental tests. However, I should note that while attempting to use source etc/lmod-setup.csh gaea_c5, it failed due to no file named Lmod_init_C5.csh being in /lustre/f2/dev/role.epic/contrib. I also found it interesting that I had to load python before I could use ./manage_externals/checkout_externals.

Similar to what you are encountering, the tests are failing in the make_sfc_climo task. I haven't seen this particular NetCDF error before, but looking up the error message:

FATAL ERROR: ERROR IN NF90_CREATE: Permission denied
 STOP.

it looks like we don't have permission to create the necessary NetCDF file. It isn't clear to me why this would be the case, unless make_sfc_climo is attempting to create a file in the EPIC role account space, rather than in my own local directory.

Since sfc_climo_gen.fd is in UFS_UTILS, is it possible that we need to do something similar there as we had to do with sorc/ufs-weather-model/cmake?

etc/lmod-setup.csh Show resolved Hide resolved
@natalie-perlin
Copy link
Collaborator Author

@MichaelLueken - fixed the file Lmod_init_C5.csh

@MichaelLueken MichaelLueken added the enhancement New feature or request label Sep 20, 2023
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natalie-perlin - Thanks for your work on porting the SRW App to Gaea C5! Your branch was cloned and built using the Jenkins build script (.cicd/scripts/srw_build.sh) and the coverage.gaea_c5 test suite was run using the Jenkins test script (.cicd/scripts/srw_test.sh). All coverage tests successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community                                                          COMPLETE              43.50
custom_ESGgrid_NewZealand_3km                                      COMPLETE              48.17
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              26.46
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP              COMPLETE              30.58
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE              30.32
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             312.60
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              30.80
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta     COMPLETE             274.60
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot      COMPLETE              17.23
nco_ensemble                                                       COMPLETE              99.47
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom  COMPLETE             307.34
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1221.07

Approving this PR now.

Copy link
Collaborator

@RatkoVasic-NOAA RatkoVasic-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on C5.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Sep 21, 2023
@MichaelLueken
Copy link
Collaborator

@natalie-perlin - While submitting the Jenkins tests, I noted that the gaea_c5 runner wasn't being submitted. Looking at the pipeline.log file, I saw that there is no gaea_c5 label. I reached out to Kris and the platform team on Slack and Kris noted that the Gaea C5 label should be gaea-c5. If you could go through and replace the underscore with a hyphen, I would greatly appreciate it! Please let me know if you need assistance, and I will attempt to make these changes to your branch. Thank you very much!

@MichaelLueken
Copy link
Collaborator

The WE2E coverage tests were successfully run on Derecho:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_IndianOcean_6km                                     COMPLETE              21.93
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              35.76
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              42.79
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR           COMPLETE              26.37
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              16.38
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR                COMPLETE              38.75
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              22.16
pregen_grid_orog_sfc_climo                                         COMPLETE              13.49
specify_template_filenames                                         COMPLETE              13.97
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             231.60

@natalie-perlin
Copy link
Collaborator Author

@MichaelLueken @RatkoVasic-NOAA - all instances of "gaea_c5" renamed to "gaea-c5" ( a file or directory name, or a string in a file )

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - Unfortunately, following the merge of PR #911, there is now a conflict with this PR in .cicd/Jenkinsfile, tests/build.sh, and ush/valid_param_vals.yaml. Please merge this morning's hash update (87dbf19) to your develop_gaea_c5 branch, address the three conflicts, then I should be able to requeue the Jenkins tests for this PR.

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - Thank you for merging the latest HEAD into your branch and correcting the conflicts! I have requeued the Jenkins tests for this PR and will let you know if there are any issues.

@MichaelLueken
Copy link
Collaborator

The Jenkins Hera Intel tests have completed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Central_Asia_3km                                    DEAD                   6.63
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200          COMPLETE               6.30
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             757.90
get_from_HPSS_ics_HRRR_lbcs_RAP                                    COMPLETE              14.01
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               6.30
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              13.01
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE              10.13
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE               6.65
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         COMPLETE             231.40
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             303.34
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             324.52
pregen_grid_orog_sfc_climo                                         COMPLETE               8.07
----------------------------------------------------------------------------------------------------
Total                                                              DEAD                1688.26

A rerun on the custom_ESGgrid_Central_Asia_3km test shows successful completion:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Central_Asia_3km                                    COMPLETE              24.99
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE              24.99

The Gaea and Hercules tests have successfully completed. The Jet tests are still running. The Gaea C5 and Orion tests have been requeued. Once the tests complete, I will move forward with merging this PR.

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - The Jenkins tests successfully passed on Hera GNU, Jet, Orion. The Gaea C5 Functional Workflow Task Tests failed with the following message:

Running: sbatch -A epic --parsable /lustre/f2/dev/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-898/.cicd/scripts/sbatch_srw_ftest.sh

sbatch: error: Batch job submission failed: Invalid qos specification

It looks like you will also need to add the:

sed -i 's|qos=batch|qos=windfall|g' ${WORKSPACE}/.cicd/scripts/${workflow_cmd}_srw_ftest.sh

to the Gaea C5 section of .cicd/scripts/wrapper_srw_ftest.sh, replacing qos=windfall with qos=normal. With this change, I'll kick off the Jenkins tests for Gaea C5 again.

@natalie-perlin
Copy link
Collaborator Author

natalie-perlin commented Sep 27, 2023

@MichaelLueken - done with changes for Gaea C5 in .cicd/scripts/wrapper_srw_ftest.sh, adding a line sed -i 's|qos=batch|qos=normal|g' ${WORKSPACE}/.cicd/scripts/${workflow_cmd}_srw_ftest.sh

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - Thanks! Requeuing the Jenkins tests for Gaea C5 now. I'll let you know if any other issues arise.

@MichaelLueken
Copy link
Collaborator

The latest Jenkins tests successfully passed on Gaea C5:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community                                                          COMPLETE              43.58
custom_ESGgrid_NewZealand_3km                                      COMPLETE              49.38
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              27.39
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP              COMPLETE              30.87
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE              31.52
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             318.02
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              31.69
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta     COMPLETE             277.13
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot      COMPLETE              17.73
nco_ensemble                                                       COMPLETE             102.79
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom  COMPLETE             312.02
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1242.12

Since Orion and Hercules are down, manual submission of the Jenkins pipeline caused jobs to be kicked off on these machines, which were aborted.

Moving forward with merging this PR now.

@MichaelLueken MichaelLueken merged commit 0b1b070 into ufs-community:develop Sep 27, 2023
3 of 5 checks passed
@BruceKropp-Raytheon
Copy link
Contributor

I can confirm this PR code can build SRW on gaea-c5, and produce a E2E plot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Gaea C5 to supported platforms, as Tier-1 system
4 participants