Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] MET_verification: use modules met, metplus #826

Merged

Conversation

natalie-perlin
Copy link
Collaborator

@natalie-perlin natalie-perlin commented Jun 8, 2023

DESCRIPTION OF CHANGES:

  • Use modules met and metplus installed in a software stack for MET verification tasks. Do not use explicitly set paths in machine files or config.yaml files. Use env. variables METPLUS_PATH, MET_INSTALL_DIR from the modulefiles. Retire MET_BIN_EXEC variables, use standard "bin". Modules met/10.1.2 and metplus/4.1.3 are used. Newer versions, met/11.0.2 and metplus/5.0.2 require additional code changes.

  • Updated Orion stack location due to a mandatory transition to a new role account and space, under /work/noaa/epic/role-epic/. New stack is under /work/noaa/epic/role-epic/contrib/orion/hpc-stack/intel-2022.1.2/ . SRW fundamental tests have been run successfully with the updates stack, as well as GSI regression tests and UFS-WM regression tests. PR-1846 to the weather model repo has been submitted: Update orion stack path to use a new role-epic account location ufs-weather-model#1846 .
    Miniconda3 with all the environments has been installed in a new Orion location, under /work/noaa/epic/role-epic/contrib/orion/miniconda3/ and updated in modulefiles.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu (partially)
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

The current PR resolves the issue #863 .

DOCUMENTATION:

Preliminary changes to documentation in files:
./docs/UsersGuide/source/ConfigWorkflow.rst
./docs/UsersGuide/source/RunSRW.rst
./docs/UsersGuide/source/VXCases.rst

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • [ x] Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

@MichaelLueken MichaelLueken added documentation Improvements or additions to documentation enhancement New feature or request Work in Progress Needs Cheyenne test Testing needs to be run on Cheyenne machine Needs Jet test Testing needs to be run on Jet machine Needs Orion test Testing needs to be run on MSU Orion machine labels Jun 8, 2023
@gsketefian
Copy link
Collaborator

@natalie-perlin Thanks for switching MET/METplus to the official software stacks. Couple of questions:

  1. Have the 4 verification tests (in tests/WE2E/test_configs/verification) passed with the slightly newer versions you're using in this PR?
  2. Does the software stack on Hera have the METplus 5.0 coordinated release installed (this one)? That includes MET 11.0.2 and METplus 5.0.2. Just curious since there are certain features we (the DTC) would like to implement that are only available in newer versions.

Thanks.

@MichaelLueken
Copy link
Collaborator

@natalie-perlin Since this PR will correct the the failure of the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 fundamental WE2E test on Cheyenne GNU, I will go ahead and close PR #820 at this time.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Aug 4, 2023
@MichaelLueken
Copy link
Collaborator

The verification WE2E tests were run on Orion and all tests successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
MET_ensemble_verification                                          COMPLETE              18.86
MET_ensemble_verification_only_vx                                  COMPLETE               0.91
MET_ensemble_verification_only_vx_time_lag                         COMPLETE               3.73
MET_verification                                                   COMPLETE               9.49
MET_verification_only_vx                                           COMPLETE               0.12
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE              33.11

The Cheyenne Intel WE2E coverage tests were manually run on Hera and all tests successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE      COMPLETE              11.21
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              29.91
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              19.55
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR           COMPLETE              25.42
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE               8.58
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR                COMPLETE              16.47
pregen_grid_orog_sfc_climo                                         COMPLETE               7.58
specify_template_filenames                                         COMPLETE               7.31
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             126.03

And the Cheyenne GNU WE2E coverage tests were manually run on Hera and all tests successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16      COMPLETE              20.16
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta      COMPLETE             225.99
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp  COMPLETE             109.29
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              27.27
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE              36.95
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16   COMPLETE              22.26
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP                 COMPLETE             318.31
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16               COMPLETE              49.60
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS                              COMPLETE              10.10
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             819.93

Awaiting completion of Jenkins tests now.

@MichaelLueken
Copy link
Collaborator

@natalie-perlin -

The Hera GNU WE2E coverage test, MET_ensemble_verification_only_vx_time_lag, is failing in both run_MET_GenEnsProd_vx_SFC and run_MET_GenEnsProd_vx_UPA tasks due to hitting the time limit:

slurmstepd: error: *** JOB 47745246 ON h5c52 CANCELLED AT 2023-08-04T19:30:59 DUE TO TIME LIMIT ***

While changing from static libraries to HPC-stack modules, are you aware of any reason why these two tasks seem to now run longer for Hera GNU? I'm seeing this for both manual and automated Jenkins testing. Please see /scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-826/expt_dirs/MET_ensemble_verification_only_vx_time_lag for the Jenkins working directory associated with this PR.

@natalie-perlin
Copy link
Collaborator Author

natalie-perlin commented Aug 4, 2023

@MichaelLueken -
I'll take a look at Hera Gnu runs, haven't run it for gnu compiler.
Hera Gnu compilers were known to be slower, but there might be another reason for the failure.

@MichaelLueken
Copy link
Collaborator

I was able to successfully run all verification tests on Hera GNU, with the exception of MET_ensemble_verification_only_vx_time_lag:

Experiment MET_verification_only_vx is COMPLETE
Took 0:03:35.738463; will no longer monitor.
Experiment MET_ensemble_verification_only_vx is COMPLETE
Took 0:17:49.131284; will no longer monitor.
Experiment MET_verification is COMPLETE
Took 0:26:45.456390; will no longer monitor.
Experiment MET_ensemble_verification is COMPLETE
Took 0:47:46.819661; will no longer monitor.

Since the rest of the tests run without issue, I'm inclined to agree that there might be an issue with the current MET_ensemble_verification_only_vx_time_lag test. I'll try and dig into this as well.

@natalie-perlin
Copy link
Collaborator Author

@MichaelLueken - the issue with Hera gnu was with the installation of these older modules. All seem to be resolved after I reinstalled met/10.1.2 and metplus/4.1.3.
Manually running met verification tests on Hera, currently 4 out of 5 finished, one more is still running:

Writing information for all experiments to WE2E_tests_20230807171118.yaml
Checking tests available for monitoring...
Starting experiment MET_verification running
Updating database for experiment MET_verification
Starting experiment MET_verification_only_vx running
Updating database for experiment MET_verification_only_vx
Starting experiment MET_ensemble_verification running
Updating database for experiment MET_ensemble_verification
Starting experiment MET_ensemble_verification_only_vx running
Updating database for experiment MET_ensemble_verification_only_vx
Starting experiment MET_ensemble_verification_only_vx_time_lag running
Updating database for experiment MET_ensemble_verification_only_vx_time_lag
Setup complete; monitoring 5 experiments
Use ctrl-c to pause job submission/monitoring
Experiment MET_verification_only_vx is COMPLETE
Took 0:03:45.488077; will no longer monitor.
Experiment MET_ensemble_verification_only_vx is COMPLETE
Took 0:17:49.627754; will no longer monitor.
Experiment MET_verification is COMPLETE
Took 0:29:00.411291; will no longer monitor.
Experiment MET_ensemble_verification is COMPLETE
Took 0:47:51.449840; will no longer monitor.

The last test, MET_ensemble_verification_only_vx_time_lag, is still running, but all going successfully so far:

 (workflow_tools) [Natalie.Perlin@Hera:/scratch1/NCEPDEV/stmp2/Natalie.Perlin/SRW/expt_dirs/MET_ensemble_verification_only_vx_time_lag]$  rocotostat -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
202105050000       run_MET_Pb2nc_obs                    47836209           SUCCEEDED                   0         1          48.0
202105050000    run_MET_PcpCombine_obs_APCP01h                    47836210           SUCCEEDED                   0         1          16.0
202105050000    run_MET_PcpCombine_obs_APCP03h                    47836211           SUCCEEDED                   0         1          15.0
202105050000    run_MET_PcpCombine_obs_APCP06h                    47836212           SUCCEEDED                   0         1          14.0
202105050000    check_post_output_mem001                    47836213           SUCCEEDED                   0         1          12.0
202105050000    check_post_output_mem002                    47836214           SUCCEEDED                   0         1          11.0
202105050000    run_MET_PcpCombine_fcst_APCP01h_mem001                    47836256           SUCCEEDED                   0         1          32.0
202105050000    run_MET_PcpCombine_fcst_APCP01h_mem002                    47836257           SUCCEEDED                   0         1          31.0
202105050000    run_MET_PcpCombine_fcst_APCP03h_mem001                    47836258           SUCCEEDED                   0         1          27.0
202105050000    run_MET_PcpCombine_fcst_APCP03h_mem002                    47836259           SUCCEEDED                   0         1          27.0
202105050000    run_MET_PcpCombine_fcst_APCP06h_mem001                    47836260           SUCCEEDED                   0         1          26.0
202105050000    run_MET_PcpCombine_fcst_APCP06h_mem002                    47836261           SUCCEEDED                   0         1          26.0
202105050000    run_MET_GridStat_vx_APCP01h_mem001                    47836305           SUCCEEDED                   0         1         508.0
202105050000    run_MET_GridStat_vx_APCP01h_mem002                    47836306           SUCCEEDED                   0         1         504.0
202105050000    run_MET_GridStat_vx_APCP03h_mem001                    47836307           SUCCEEDED                   0         1         205.0
202105050000    run_MET_GridStat_vx_APCP03h_mem002                    47836308           SUCCEEDED                   0         1         204.0
202105050000    run_MET_GridStat_vx_APCP06h_mem001                    47836309           SUCCEEDED                   0         1         124.0
202105050000    run_MET_GridStat_vx_APCP06h_mem002                    47836310           SUCCEEDED                   0         1         126.0
202105050000    run_MET_GridStat_vx_REFC_mem001                    47836262           SUCCEEDED                   0         1         527.0
202105050000    run_MET_GridStat_vx_RETOP_mem001                    47836263           SUCCEEDED                   0         1         559.0
202105050000    run_MET_GridStat_vx_REFC_mem002                    47836264           SUCCEEDED                   0         1         531.0
202105050000    run_MET_GridStat_vx_RETOP_mem002                    47836265           SUCCEEDED                   0         1         557.0
202105050000    run_MET_PointStat_vx_SFC_mem001                    47836288           SUCCEEDED                   0         1         221.0
202105050000    run_MET_PointStat_vx_UPA_mem001                    47836289           SUCCEEDED                   0         1         113.0
202105050000    run_MET_PointStat_vx_SFC_mem002                    47836290           SUCCEEDED                   0         1         225.0
202105050000    run_MET_PointStat_vx_UPA_mem002                    47836291           SUCCEEDED                   0         1         113.0
202105050000    run_MET_GenEnsProd_vx_APCP01h                    47836311           SUCCEEDED                   0         1        1210.0
202105050000    run_MET_EnsembleStat_vx_APCP01h                    47836654           SUCCEEDED                   0         1        1321.0
202105050000    run_MET_GenEnsProd_vx_APCP03h                    47836312           SUCCEEDED                   0         1         399.0
202105050000    run_MET_EnsembleStat_vx_APCP03h                    47836392           SUCCEEDED                   0         1         449.0
202105050000    run_MET_GenEnsProd_vx_APCP06h                    47836313           SUCCEEDED                   0         1         205.0
202105050000    run_MET_EnsembleStat_vx_APCP06h                    47836363           SUCCEEDED                   0         1         225.0
202105050000    run_MET_GenEnsProd_vx_REFC                    47836224           SUCCEEDED                   0         1        1404.0
202105050000    run_MET_EnsembleStat_vx_REFC                    47836703           SUCCEEDED                   0         1         444.0
202105050000    run_MET_GenEnsProd_vx_RETOP                    47836225           SUCCEEDED                   0         1        1407.0
202105050000    run_MET_EnsembleStat_vx_RETOP                    47836705           SUCCEEDED                   0         1         513.0
202105050000    run_MET_GenEnsProd_vx_SFC                    47838115             RUNNING                   -         1           0.0
202105050000    run_MET_EnsembleStat_vx_SFC                           -                   -                   -         -             -
202105050000    run_MET_GenEnsProd_vx_UPA                    47838116             RUNNING                   -         1           0.0
202105050000    run_MET_EnsembleStat_vx_UPA                           -                   -                   -         -             -
202105050000    run_MET_GridStat_vx_ensmean_APCP01h                    47836655           SUCCEEDED                   0         1         447.0
202105050000    run_MET_GridStat_vx_ensmean_APCP03h                    47836393           SUCCEEDED                   0         1         151.0
202105050000    run_MET_GridStat_vx_ensmean_APCP06h                    47836364           SUCCEEDED                   0         1          79.0
202105050000    run_MET_GridStat_vx_ensprob_APCP01h                    47836656           SUCCEEDED                   0         1         902.0
202105050000    run_MET_GridStat_vx_ensprob_APCP03h                    47836394           SUCCEEDED                   0         1         300.0
202105050000    run_MET_GridStat_vx_ensprob_APCP06h                    47836365           SUCCEEDED                   0         1         155.0

On Cheyenne, the previous installations of met and metplus for Gnu compilers were fully complete for met/11.02 and metplus/5.0.2; thus I reinstalled met/10.1.2 and metplus/4.1.3 for Cheyenne Gnu:

/glade/work/epicufsrt/contrib/hpc-stack/gnu10.1.0
/glade/work/epicufsrt/contrib/hpc-stack/gnu11.2.0

Tests cannot be run though due to account overspent. Please feel free to test it if needed.

@MichaelLueken
Copy link
Collaborator

@natalie-perlin -

I'm resubmitting the MET_ensemble_verification_only_vx_time_lag test now.

Please let me know if you the two tasks that were still running for your test - run_MET_GenEnsProd_vx_SFC and run_MET_GenEnsProd_vx_UPA - successfully passed. These are the two tasks that seem to time out after an hour.

@natalie-perlin
Copy link
Collaborator Author

@MichaelLueken - yes, indeed these two tasks timed out, even though they seemed to be running OK as appears from the log files.

I'm resubmitting these tasks and increasing the time to 2:00:00 from the original 1:00:00, modifying them directly in FV3LAM_wflow.xml.

Some comparisons of similar tasks for the runs with intel compiler vs. runs with gnu compiler indeed show a significant time increase, 2-fold and 4+-fold,. Notice for example, tasks run_MET_GenEnsProd_vx_REFC, run_MET_EnsembleStat_vx_REFC, run_MET_GenEnsProd_vx_RETOP, run_MET_EnsembleStat_vx_RETOP:

Hera with Intel compiler:

202105050000    run_MET_GenEnsProd_vx_REFC                    47607444           SUCCEEDED                   0         1         309.0
202105050000    run_MET_EnsembleStat_vx_REFC                    47607576           SUCCEEDED                   0         1         258.0
202105050000    run_MET_GenEnsProd_vx_RETOP                    47607445           SUCCEEDED                   0         1         305.0
202105050000    run_MET_EnsembleStat_vx_RETOP                    47607570           SUCCEEDED                   0         1         276.0

Hera with gnu/9.2 compiler:

202105050000    run_MET_GenEnsProd_vx_REFC                    47836224           SUCCEEDED                   0         1        1404.0
202105050000    run_MET_EnsembleStat_vx_REFC                    47836703           SUCCEEDED                   0         1         444.0
202105050000    run_MET_GenEnsProd_vx_RETOP                    47836225           SUCCEEDED                   0         1        1407.0
202105050000    run_MET_EnsembleStat_vx_RETOP                    47836705           SUCCEEDED                   0         1         513.0

@natalie-perlin
Copy link
Collaborator Author

natalie-perlin commented Aug 7, 2023

The tasks run_MET_GenEnsProd_vx_SFC and run_MET_GenEnsProd_vx_UPA had the following runtimes for the Hera Intel compilers:

202105050000    run_MET_GenEnsProd_vx_SFC                    47607490           SUCCEEDED                   0         1        1571.0
202105050000    run_MET_GenEnsProd_vx_UPA                    47607491           SUCCEEDED                   0         1         778.0

Anticipating possible 4-fold increase, we'd need to have ~6300 s, i.e. 1.75 hours, or for 4.6-fold increase, slightly over 7200 s (2hours).

@natalie-perlin
Copy link
Collaborator Author

@MichaelLueken -
All the verification tasks were eventually completed on Hera Gnu, but it took over 2 hours (7945s) for the longest task run_MET_GenEnsProd_vx_SFC. Time requests for the task needed to be adjusted.

 202105050000    run_MET_GenEnsProd_vx_SFC                    47855057           SUCCEEDED                   0         1        7945.0
202105050000    run_MET_EnsembleStat_vx_SFC                    47869213           SUCCEEDED                   0         1         369.0
202105050000    run_MET_GenEnsProd_vx_UPA                    47839775           SUCCEEDED                   0         1        3907.0
202105050000    run_MET_EnsembleStat_vx_UPA                    47846555           SUCCEEDED                   0         1         148.0

How we should proceed with this?

@natalie-perlin
Copy link
Collaborator Author

@mkavulich -
what would be a good way to specify longer walltime as task requirement for the met verification tasks?..

The tasks run_MET_GenEnsProd_vx_SFC and run_MET_GenEnsProd_vx_UPA running on Hera gnu/9.2 take way over a default hour of walltime.
I was looking through the documentation on Workflow https://ufs-srweather-app.readthedocs.io/en/develop/ConfigWorkflow.html
or WE2E tests https://ufs-srweather-app.readthedocs.io/en/develop/WE2Etests.html to look for the clues on walltime setup. (Please feel free to point to the right section in the docs if any!)

@MichaelLueken
Copy link
Collaborator

@natalie-perlin -

It looks like going into parm/wflow/verify_ens.yaml, going to line 140, and adding:

walltime: 02:30:00

will increase the walltime for the run_MET_GenEnsProd_vx_UPA/SFC tasks to two and a half hours. Please note that this value can be changed as desired.

The final section would look like:

metatask_GenEnsProd_EnsembleStat_NDAS:
  var:
    VAR: '{% for var in verification.VX_FIELDS %}{% if var in ["SFC", "UPA"] %}{{ "%s " % var }}{% endif %}{% endfor %}'
  task_run_MET_GenEnsProd_vx_#VAR#: &task_GenEnsProd_NDAS
    <<: *default_task_verify_ens
    command: '&LOAD_MODULES_RUN_TASK_FP; "run_vx" "&JOBSdir;/JREGIONAL_RUN_MET_GENENSPROD_OR_ENSEMBLESTAT"'
    envars: &envars_GenEnsProd_NDAS
      <<: *default_vars
      OBS_DIR: '&NDAS_OBS_DIR;'
      VAR: '#VAR#'
      MET_TOOL: 'GENENSPROD'
      OBTYPE: 'NDAS'
      ACCUM_HH: '01'
    walltime: 02:30:00
    dependency:

Additionally, you will need to add the following to line 156:

walltime: 01:00:00

Otherwise, the updated walltime will be used in the run_MET_EnsembleStat_vx_UPA/SFC section as well. The final appearance here should look like:

  task_run_MET_EnsembleStat_vx_#VAR#:
    <<: *task_GenEnsProd_NDAS
    envars:
      <<: *envars_GenEnsProd_NDAS
      MET_TOOL: 'ENSEMBLESTAT'
    walltime: 01:00:00
    dependency:

Please see /scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/parm/wflow/verify_ens.yaml for what the modified file should look like.

Please note that I'm still in the process of testing these changes, so I would recommend running the test with the modified parm/wflow/verify_ens.yaml file before committing it and pushing back. I have been able to verify that the two tests that need a longer walltime will have it, while other tasks will not have the extended walltimes.

@MichaelLueken
Copy link
Collaborator

@natalie-perlin -

My test has successfully completed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
MET_ensemble_verification_only_vx_time_lag                         COMPLETE               8.55
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE               8.55

The two tasks that were failing are now passing with the increased walltime from parm/wflow/verify_ens.yaml:

202105050000    run_MET_GenEnsProd_vx_SFC                    47872102           SUCCEEDED                   0         1        7922.0
202105050000    run_MET_EnsembleStat_vx_SFC                    47878288           SUCCEEDED                   0         1         363.0
202105050000    run_MET_GenEnsProd_vx_UPA                    47872103           SUCCEEDED                   0         1        3838.0
202105050000    run_MET_EnsembleStat_vx_UPA                    47875308           SUCCEEDED                   0         1         147.0

Once you have verified that the modification works, please commit the change, push it to your branch, then I will relaunch the Hera Jenkins tests.

@natalie-perlin
Copy link
Collaborator Author

Implemented recent changes in develop into my branch, added the changes for the ./parm/wflow/parm/wflow/verify_ens.yaml. The workflow generated FV3LAM_wflow.xml with "wallclock 2:30" correctly, and I'm waiting for the test to finish.

@MichaelLueken
Copy link
Collaborator

The Hera GNU WE2E coverage tests have been manually run and all tests have successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200         COMPLETE              18.69
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS                             COMPLETE              45.32
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE             230.44
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp  COMPLETE              17.64
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta      COMPLETE              35.98
quilting_false                                                     COMPLETE              13.69
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              22.77
GST_release_public_v1                                              COMPLETE              54.02
MET_verification_only_vx                                           COMPLETE               0.13
MET_ensemble_verification_only_vx_time_lag                         COMPLETE               8.55
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16      COMPLETE             334.74
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             781.97

Will requeue the automated tests on Hera to make sure they pass now as well.

@natalie-perlin
Copy link
Collaborator Author

All the tasks succeeded using the updated branch:

202105050000    run_MET_GenEnsProd_vx_SFC                    47885737           SUCCEEDED                   0         1        7839.0
202105050000    run_MET_EnsembleStat_vx_SFC                    47894358           SUCCEEDED                   0         1         370.0
202105050000    run_MET_GenEnsProd_vx_UPA                    47885738           SUCCEEDED                   0         1        3855.0
202105050000    run_MET_EnsembleStat_vx_UPA                    47890632           SUCCEEDED                   0         1         147.0

So if no more questions, I guess this PR is ready to be merged.

@MichaelLueken
Copy link
Collaborator

The automated Jenkins tests on Hera successfully passed GNU:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200         COMPLETE              19.32
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS                             COMPLETE              44.11
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE             231.61
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp  COMPLETE              17.81
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta      COMPLETE              35.96
quilting_false                                                     COMPLETE              13.79
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              23.23
GST_release_public_v1                                              COMPLETE              54.28
MET_verification_only_vx                                           COMPLETE               0.14
MET_ensemble_verification_only_vx_time_lag                         COMPLETE               8.56
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16      COMPLETE             337.93
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             786.74

and Intel:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200          COMPLETE               6.10
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             774.63
get_from_HPSS_ics_HRRR_lbcs_RAP                                    COMPLETE              14.25
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp  COMPLETE               8.59
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               6.08
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16          COMPLETE              11.43
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE              10.62
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE               6.84
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         COMPLETE             250.73
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             312.12
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             335.11
pregen_grid_orog_sfc_climo                                         COMPLETE               7.37
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1743.87

Moving forward with merging this work now.

@MichaelLueken MichaelLueken merged commit f2abb86 into ufs-community:develop Aug 9, 2023
4 of 5 checks passed
@natalie-perlin
Copy link
Collaborator Author

@gsketefian -
I'm looking to understand a little better on what does met and meplus modules use as to file formats and binaries needed to be built to read these files.
There are some problems with using MET/metplus modules being build on Derecho Cray platform, which is set differently from other Crays, such as Gaea C5 and Gaea C5. Errors appear at the runtime, and resemble similar errors we've seen before on Derecho that were resolved by more differential use of serial vs mpi compilers.

The binaries that returns error are from ./met/10.1.2/bin/, and are grid_stat, pb2nc, pcp_combine (maybe more), so this must be something to deal with the netcdf files.
One of the tasks from the Met_verification experiment, get_obs_ccpa, looks for grib2 files and uses the ones staged on disk (/glade/work/epicufsrt/contrib/UFS_SRW_data/develop/obs_data/ccpa/proc/20190615/ccpa.t06z.01h.hrap.conus.gb2 on Cheyenne/Derecho). Which means that it uses grib2 files. This implies I may need to look into the way wgrib2 module/binaries are built on the system.
Does it seem reasonable that netcdf and grib2 files are the only formats used (and that may require more scrutiny)?

@gsketefian
Copy link
Collaborator

gsketefian commented Aug 24, 2023

@natalie-perlin As far as I know, in the SRW App the vx that MET/METplus does uses grib2, netcdf, and text formats. But I'm not an expert on METplus inner workings.

What installation of MET/METplus on Derecho are you using? If it wasn't installed by someone on the METplus team, it might be worthwhile to contact them. They'll have more insight than me. I think Julie Prestopnik is usually the one who does installations.

BruceKropp-Raytheon pushed a commit to NOAA-EPIC/ufs-srweather-app that referenced this pull request Sep 19, 2023
* Mods to METplus conf files: TCDC specifications, correction to level specificatons in point-stat mean and prob files, and added functionality to make METplus output dirs in ex scripts.

* Updated comments in MET ex-scripts for creating output directories.

* Fixed minor formatting issue in exregional_run_gridstatvx.sh
@natalie-perlin natalie-perlin deleted the feature/modules_met_metplus branch October 13, 2023 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request Needs Cheyenne test Testing needs to be run on Cheyenne machine Needs Jet test Testing needs to be run on Jet machine Needs Orion test Testing needs to be run on MSU Orion machine run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests Work in Progress
Projects
Status: Done
5 participants