Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HR4 GWD update for FV3 #2290

Open
wants to merge 20 commits into
base: develop
Choose a base branch
from

Conversation

Qingfu-Liu
Copy link
Collaborator

@Qingfu-Liu Qingfu-Liu commented May 19, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

This PR#2290 depends on fv3atm PR##836 and ccpp-physics PR#207. These PRs are created for updating the code of the GWD.

Commit Message:

  * FV3 - 
    * ccpp-physics - 
  * NOAHMP - 

Priority:

  • High: This PR#2290 is created for HR4 and updating GWD code

Git Tracking

UFSWM:

  • None

Sub component Pull Requests:

UFSWM Blocking Dependencies:

  • None

Changes

Regression Test Changes (Please commit test_changes.list):

  • PR Updates/Changes Baselines.

Input data Changes:

  • There is a new set of orographic data related to this PR#2290 and the new data is described in PR#2670 workflow:
    HR4 GWD update NOAA-EMC/global-workflow#2670
    This PR#2670 includes changes for four scripts, and new orographic data.
    The new orographic data temporarily stored at:
    /scratch1/NCEPDEV/global/Qingfu.Liu/git/GWD_SHong/APR2023_SSO
    and this new data should replace the old data: /scratch1/NCEPDEV/global/glopara/fix/ugwd/20231027
  • New orographic input files on hera:
    • /scratch2/NAGAPE/epic/UFS-WM_RT/NEMSfv3gfs/input-data-20240501/FV3_input_data/INPUT_L127_gfsv17
    • /scratch2/NAGAPE/epic/UFS-WM_RT/NEMSfv3gfs/input-data-20240501/FV3_input_data48/INPUT_L127_gfsv17
    • /scratch2/NAGAPE/epic/UFS-WM_RT/NEMSfv3gfs/input-data-20240501/FV3_input_data192/INPUT_L127_gfsv17
    • /scratch2/NAGAPE/epic/UFS-WM_RT/NEMSfv3gfs/input-data-20240501/FV3_input_data384/INPUT_L127_gfsv17

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@grantfirl
Copy link
Collaborator

All (@Qingfu-Liu @barlage @JongilHan66 @mdtoyNOAA @cenlinhe @BoYang-NOAA), please look over this list copied from test_changes.list that comes out of the regression tests and confirm that all tests are expected to change results given the changes in NOAA-EMC/fv3atm#836 and ufs-community/ccpp-physics#207).

cpld_control_p8_mixedmode intel
cpld_control_gfsv17 intel
cpld_control_gfsv17_iau intel
cpld_restart_gfsv17 intel
cpld_mpi_gfsv17 intel
cpld_control_sfs intel
cpld_debug_gfsv17 intel
cpld_control_p8 intel
cpld_control_p8.v2.sfc intel
cpld_restart_p8 intel
cpld_control_qr_p8 intel
cpld_restart_qr_p8 intel
cpld_2threads_p8 intel
cpld_decomp_p8 intel
cpld_mpi_p8 intel
cpld_control_ciceC_p8 intel
cpld_control_c192_p8 intel
cpld_restart_c192_p8 intel
cpld_bmark_p8 intel
cpld_restart_bmark_p8 intel
cpld_s2sa_p8 intel
cpld_control_noaero_p8 intel
cpld_control_nowave_noaero_p8 intel
cpld_debug_p8 intel
cpld_debug_noaero_p8 intel
cpld_control_noaero_p8_agrid intel
cpld_control_c48 intel
cpld_warmstart_c48 intel
cpld_restart_c48 intel
cpld_control_p8_faster intel
cpld_control_pdlib_p8 intel
cpld_restart_pdlib_p8 intel
cpld_mpi_pdlib_p8 intel
cpld_debug_pdlib_p8 intel
control_CubedSphereGrid intel
control_CubedSphereGrid_parallel intel
control_latlon intel
control_wrtGauss_netcdf_parallel intel
control_c48 intel
control_c192 intel
control_c384 intel
control_p8 intel
control_p8.v2.sfc intel
control_p8_ugwpv1 intel
control_restart_p8 intel
control_noqr_p8 intel
control_restart_noqr_p8 intel
control_decomp_p8 intel
control_2threads_p8 intel
control_p8_lndp intel
control_p8_rrtmgp intel
control_p8_mynn intel
merra2_thompson intel
control_p8_faster intel
control_CubedSphereGrid_debug intel
control_wrtGauss_netcdf_parallel_debug intel
control_diag_debug intel
control_debug_p8 intel
rrfs_v1beta_debug intel
control_p8_atmlnd_sbs intel
control_p8_atmlnd intel
control_restart_p8_atmlnd intel
control_p8_atmlnd_debug intel
atmwav_control_noaero_p8 intel
atmaero_control_p8 intel
atmaero_control_p8_rad intel
atmaero_control_p8_rad_micro intel
control_c48 gnu
control_p8 gnu
control_p8_ugwpv1 gnu
control_diag_debug gnu
rrfs_v1beta_debug gnu
control_debug_p8 gnu
cpld_control_nowave_noaero_p8 gnu
cpld_control_pdlib_p8 gnu
cpld_debug_pdlib_p8 gnu

@mdtoyNOAA
Copy link
Contributor

Looks good

@grantfirl
Copy link
Collaborator

@Qingfu-Liu Here is the PR into your branch to add RT logs: Qingfu-Liu#1 (also updates FV3 submodule pointer)

update RT logs and FV3 submodule pointer
@grantfirl
Copy link
Collaborator

@Qingfu-Liu Can we move this out of draft and fill out the description?

@Qingfu-Liu Qingfu-Liu marked this pull request as ready for review June 6, 2024 12:42
@Qingfu-Liu
Copy link
Collaborator Author

@Qingfu-Liu Can we move this out of draft and fill out the description?

@grantfirl Thank you very much. I just updated the description and change the draft to PR

@jkbk2004
Copy link
Collaborator

@Qingfu-Liu can you sync up branch? We can start working on this pr.

@grantfirl
Copy link
Collaborator

@jkbk2004 Thanks very much for syncing this. I didn't have permissions to do this on Qingfu's branch. I'm still out of the country until mid next week, although I'll be available off and on throughout the day to merge if necessary if no one else is. Dustin is at the same conference as me.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jun 20, 2024

@Qingfu-Liu @grantfirl @dustinswales @mdtoyNOAA can you check the following error message about mntvar !subgrid orographic statistics data ?

111: forrtl: severe (408): fort: (2): Subscript #2 of the array MNTVAR has value 15 which is greater than the upper bound of 14

It crashes with rrfs_v1beta_debug intel control_p8_atmlnd_sbs intel control_p8_atmlnd intel control_p8_atmlnd_debug intel rrfs_v1beta_debug gnu. An experiment output is available at /scratch2/NAGAPE/epic/Jong.Kim/stmp2/Jong.Kim/FV3_RT/rt_3529436/rrfs_v1beta_debug_intel/err

@Qingfu-Liu
Copy link
Collaborator Author

Qingfu-Liu commented Jun 20, 2024

@Qingfu-Liu @grantfirl @dustinswales @mdtoyNOAA can you check the following error message about mntvar !subgrid orographic statistics data ?

111: forrtl: severe (408): fort: (2): Subscript #2 of the array MNTVAR has value 15 which is greater than the upper bound of 14

It crashes with rrfs_v1beta_debug intel control_p8_atmlnd_sbs intel control_p8_atmlnd intel control_p8_atmlnd_debug intel rrfs_v1beta_debug gnu. An experiment output is available at /scratch2/NAGAPE/epic/Jong.Kim/stmp2/Jong.Kim/FV3_RT/rt_3529436/rrfs_v1beta_debug_intel/err

@jkbk2004 There is a new set of orographic data related to this PR#2290 and the new data is described in PR#2670 workflow. Can you run the regression tests using the new data? Thank you very much.:
NOAA-EMC/global-workflow#2670
This PR#2670 includes changes for four scripts, and new orographic data.
The new orographic data temporarily stored at:
/scratch1/NCEPDEV/global/Qingfu.Liu/git/GWD_SHong/APR2023_SSO
and this new data should replace the old data: /scratch1/NCEPDEV/global/glopara/fix/ugwd/20231027

@BrianCurtis-NOAA
Copy link
Collaborator

If there are any anticipated changes to data in the input-data-YYYYMMDD directories, it should be very specifically addressed in the template. Please make changes there.

@Qingfu-Liu
Copy link
Collaborator Author

If there are any anticipated changes to data in the input-data-YYYYMMDD directories, it should be very specifically addressed in the template. Please make changes there.

OK. Thanks. After look the error, I am not sure the failed tests are related to the data change. I just add Jongil who works on the code to see if he know this.

@Qingfu-Liu
Copy link
Collaborator Author

@JongilHan66 Can you take a look of the errors @jkbk2004 mentioned in this PR#2290? Thanks

@JongilHan66
Copy link
Collaborator

Currently we use gwd_opt=2 (unified ugwp GWD), which is defined in config.fcst. Then the dimension of MNTVAR is increased from 10 to 24, as described in 'GFS_typedefs.F90'.

@JongilHan66
Copy link
Collaborator

@Qingfu-Liu If you haven't yet, please update the config.fcst files with those in the directories of gfs & gefs in /lfs/h2/emc/physics/noscrub/jongil.han/git_HR4_gwd/global-workflow/parm/config

@Qingfu-Liu
Copy link
Collaborator Author

@Qingfu-Liu If you haven't yet, please update the config.fcst files with those in the directories of gfs & gefs in /lfs/h2/emc/physics/noscrub/jongil.han/git_HR4_gwd/global-workflow/parm/config

@JongilHan66 Those files are updated in workflow PR#2670:
https://github.com/NOAA-EMC/global-workflow/pull/2670/files

@JongilHan66
Copy link
Collaborator

@Qingfu-Liu If you haven't yet, please update the config.fcst files with those in the directories of gfs & gefs in /lfs/h2/emc/physics/noscrub/jongil.han/git_HR4_gwd/global-workflow/parm/config

@JongilHan66 Those files are updated in workflow PR#2670: https://github.com/NOAA-EMC/global-workflow/pull/2670/files

@Qingfu-Liu Did you also update the "parsing_namelists_FV3.sh" and "parsing_namelists_FV3_nest.sh" in the ush directory?

@Qingfu-Liu
Copy link
Collaborator Author

@Qingfu-Liu If you haven't yet, please update the config.fcst files with those in the directories of gfs & gefs in /lfs/h2/emc/physics/noscrub/jongil.han/git_HR4_gwd/global-workflow/parm/config

@JongilHan66 Those files are updated in workflow PR#2670: https://github.com/NOAA-EMC/global-workflow/pull/2670/files

@Qingfu-Liu Did you also update the "parsing_namelists_FV3.sh" and "parsing_namelists_FV3_nest.sh" in the ush directory?

@JongilHan66 Yes, both files are updated in the PR#2670. There are 4 files are updated in PR#2670

@JongilHan66
Copy link
Collaborator

@Qingfu-Liu If you haven't yet, please update the config.fcst files with those in the directories of gfs & gefs in /lfs/h2/emc/physics/noscrub/jongil.han/git_HR4_gwd/global-workflow/parm/config

@JongilHan66 Those files are updated in workflow PR#2670: https://github.com/NOAA-EMC/global-workflow/pull/2670/files

@Qingfu-Liu Did you also update the "parsing_namelists_FV3.sh" and "parsing_namelists_FV3_nest.sh" in the ush directory?

@JongilHan66 Yes, both files are updated in the PR#2670. There are 4 files are updated in PR#2670

@Qingfu-Liu If gwd_opt=2 in config.fcst, the dimension of MNTVAR is 24 and the model should not complain the max of 14 which is for gwd_opt=1.

@Qingfu-Liu
Copy link
Collaborator Author

@JongilHan66 Thanks. I think I understand the problem.

@bingfu-NOAA
Copy link

Do all those tests pull from the right input-data directory? Or, if not, are those new settings compatible w/ the old input data?

Those new settings are compatible w/ the old input data, although it is better to point to the new dataset in case the RTs will be used by developers for science development.

Nick @NickSzapiro-NOAA is working on GEFSv13 RTs, I am providing a set of new input data.

@DeniseWorthen
Copy link
Collaborator

@jkbk2004 I don't understand why you need a new cpld_control_run_gfsv17? You already have the change in the cpld_control_run and all the cpld tests will use that (and the new input-data).

The question is more about the tests which don't use cpld_control_run and which will not use the new input-data. Given @yangfanglin 's comment that those tests will still run w/ the older data, it may be a later PR to update those tests to the new settings when they move away from the p8 settings.

@jkbk2004
Copy link
Collaborator

@jkbk2004 I don't understand why you need a new cpld_control_run_gfsv17? You already have the change in the cpld_control_run and all the cpld tests will use that (and the new input-data).

The question is more about the tests which don't use cpld_control_run and which will not use the new input-data. Given @yangfanglin 's comment that those tests will still run w/ the older data, it may be a later PR to update those tests to the new settings when they move away from the p8 settings.

Got it! I will finalize default_var.sh with the comment above and update tests/tests cpld cases accordingly. And we will expect another PR to update all other cases.

@junwang-noaa
Copy link
Collaborator

So far one of the changes that is made to global coupled regression tests is the fix file update, @yangfanglin, should the fix files be updated for the global tests with all the resolutions? Also what version is the fixed file?

@JessicaMeixner-NOAA
Copy link
Collaborator

So far one of the changes that is made to global coupled regression tests is the fix file update, @yangfanglin, should the fix files be updated for the global tests with all the resolutions? Also what version is the fixed file?

@junwang-noaa The fix file version is 20240624 (see: /scratch1/NCEPDEV/global/glopara/fix/ugwd/20240624 or the original directory /scratch1/NCEPDEV/global/Qingfu.Liu/git/GWD_SHong/APR2023_SSO and global-workflow issue: NOAA-EMC/global-workflow#2716)

@junwang-noaa
Copy link
Collaborator

Do all those tests pull from the right input-data directory? Or, if not, are those new settings compatible w/ the old input data?

Those new settings are compatible w/ the old input data, although it is better to point to the new dataset in case the RTs will be used by developers for science development.

So this requires the sfc input files to be regenerated from the new oro data?

@yangfanglin
Copy link
Collaborator

@junwang-noaa it is not required to regenerate sfc data

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Jun 26, 2024

Should the fix files be updated for the global tests with all the resolutions? @JessicaMeixner-NOAA

@yangfanglin
Copy link
Collaborator

Should the fix files be updated for the global tests with all the resolutions?

@junwang-noaa See Denise's comments above in this thread.

@jkbk2004
Copy link
Collaborator

@JessicaMeixner-NOAA @yangfanglin @JongilHan66 can you take a look at /scratch1/NCEPDEV/stmp2/Jong.Kim/FV3_RT/rt_102912/cpld_control_gfsv17_intel/input.nml ? I used the requested parameters. It crashes. err file is in the experiment directory.

@JongilHan66
Copy link
Collaborator

@JessicaMeixner-NOAA @yangfanglin @JongilHan66 can you take a look at /scratch1/NCEPDEV/stmp2/Jong.Kim/FV3_RT/rt_102912/cpld_control_gfsv17_intel/input.nml ? I used the requested parameters. It crashes. err file is in the experiment directory.

@jkbk2004 At least, the settings for HR4 GWD update are correct.

@JessicaMeixner-NOAA
Copy link
Collaborator

The run from the workflow has been submitted, but the queue is slow. I'll report back on that when I know more.

@JongilHan66
Copy link
Collaborator

@JessicaMeixner-NOAA @yangfanglin @JongilHan66 can you take a look at /scratch1/NCEPDEV/stmp2/Jong.Kim/FV3_RT/rt_102912/cpld_control_gfsv17_intel/input.nml ? I used the requested parameters. It crashes. err file is in the experiment directory.

@jkbk2004 At least, the settings for HR4 GWD update are correct.

@jkbk2004 suggestions for tests: 1) test with C768 resolution; 2) C96 test but with the old oro_data

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jun 26, 2024

@JessicaMeixner-NOAA @yangfanglin @JongilHan66 can you take a look at /scratch1/NCEPDEV/stmp2/Jong.Kim/FV3_RT/rt_102912/cpld_control_gfsv17_intel/input.nml ? I used the requested parameters. It crashes. err file is in the experiment directory.

@jkbk2004 At least, the settings for HR4 GWD update are correct.

@jkbk2004 suggestions for tests: 1) test with C768 resolution; 2) C96 test but with the old oro_data

We don't have regression test case for C768. cpld_bmark_p8 case is C384. It crashes same way. cpld_control_gfsv17_intel is C96. It still blows up even if I use old oro files. BTW, crash happens around https://github.com/Qingfu-Liu/ccpp-physics/blob/00dc921754a699d361ee0f98b4f321ca021563a1/physics/GWD/drag_suite.F90#L2616. It complains as

80: [h16c04:1656117:0:1656117] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x38)

@JongilHan66 @grantfirl @dustinswales @mdtoyNOAA sounds like memory mapping issue. array mismatch for some of those diagnostic varaiables?

@jkbk2004
Copy link
Collaborator

@grantfirl @dustinswales @JongilHan66 If I run with gnu and debug on: cpld_control_gfsv17_intel, it shows

138: At line 2557 of file /scratch2/NCEPDEV/marine/Jong.Kim/UFS-RT/test-2290/FV3/ccpp/physics/physics/GWD/drag_suite.F90
138: Fortran runtime error: Index '127' of dimension 2 of array 'velco' above upper bound of 126

@JongilHan66
Copy link
Collaborator

@grantfirl @dustinswales @JongilHan66 If I run with gnu and debug on: cpld_control_gfsv17_intel, it shows

138: At line 2557 of file /scratch2/NCEPDEV/marine/Jong.Kim/UFS-RT/test-2290/FV3/ccpp/physics/physics/GWD/drag_suite.F90
138: Fortran runtime error: Index '127' of dimension 2 of array 'velco' above upper bound of 126

@jkbk2004 to narrow down the cause, could you make 2 more tests with 1) do_gsl_drag_tofd=.false. and 2) do_gwd_opt_psl=.false.?

@yangfanglin
Copy link
Collaborator

yangfanglin commented Jun 27, 2024

  1. I first ran the RT test cpld_control_gfsv17_intel using the latest head of ufs-weather-model, then replaced ./INPUT/oro_data_ls.tile*.nc and INPUT/oro_data_ss.tile*.nc with the new dataset archived in /lfs/h2/emc/global/noscrub/emc.global/FIX/fix/ugwd/20240624/C96 and rerun the forecast. It worked too.

This indicates the new dataset itself is fine. The model code https://github.com/Qingfu-Liu/fv3atm/tree/HR4-GWD-update might have issues (@dustinswales )

  1. Recompiled the model with the above FV3 branch and its submodules to get a new model executable, rerun the same RT without changing any namelist options. Again the RT worked fine. So Qingfu's branch still works for the "old" GWD settings.

3). Finally I dropped the executable from step 2 to the step 1 RT directory, updated input.nml to set do_gsl_drag_ss=.false. and added do_gwd_opt_psl= .true. and psl_gwd_dx_factor= 6.0, rerun the case. The model failed at calling drag_suite.F90, like what John found out.

@JongilHan66 Could you compare https://github.com/Qingfu-Liu/ccpp-physics/commits/00dc921754a699d361ee0f98b4f321ca021563a1/physics/GWD/drag_suite.F90 to what you had used ?

@JessicaMeixner-NOAA
Copy link
Collaborator

My run from the workflow finally went in but after the last time I checked last night. Unfortunately I had an error about a variable not being found, not an actual forecast failure (yet). I'll keep you posted on my runs with that.

@jkbk2004
Copy link
Collaborator

We are going to start working on #2183 today while debugging continues in this pr. @BrianCurtis-NOAA @FernandoAndrade-NOAA @zach1221 FYI

@JessicaMeixner-NOAA
Copy link
Collaborator

My run from global-workflow also segfaulted, but hopefully we can use this to confirm settings between workflow and rt.sh:

Log file: https://github.com/JessicaMeixner-NOAA/global-workflow/commits/HR4-GWD-update/
global-workflow clone: /scratch1/NCEPDEV/climate/Jessica.Meixner/GWDupdate/global-workflow
forecast directory: /scratch1/NCEPDEV/stmp2/Jessica.Meixner/RUNDIRS/test01/gfsfcst.2021032312/fcst.2045110

@grantfirl
Copy link
Collaborator

@grantfirl @dustinswales @JongilHan66 If I run with gnu and debug on: cpld_control_gfsv17_intel, it shows

138: At line 2557 of file /scratch2/NCEPDEV/marine/Jong.Kim/UFS-RT/test-2290/FV3/ccpp/physics/physics/GWD/drag_suite.F90
138: Fortran runtime error: Index '127' of dimension 2 of array 'velco' above upper bound of 126

@jkbk2004 @dustinswales @JongilHan66 I'm back from the UK, and I can start looking at this failure.

@grantfirl
Copy link
Collaborator

@jkbk2004 @dustinswales @JongilHan66 FYI, I'm running a quick test for cpld_control_gfsv17_intel with a fix for the comment I made in ufs-community/ccpp-physics#207 (comment). I'll follow up with results in a bit.

@jkbk2004
Copy link
Collaborator

@grantfirl One thing to check. Would it be ok to let #2183 go first? This PR is mostly with GWD update. I think it's ok.

@grantfirl
Copy link
Collaborator

@grantfirl One thing to check. Would it be ok to let #2183 go first? This PR is mostly with GWD update. I think it's ok.

If it's already being worked on, sure. Once #2183 is merged, I'll update submodule pointers for this one so that it should be good to go next.

@DeniseWorthen
Copy link
Collaborator

Given the comment history and issues with this PR, I think it is problematic to update only a subset of tests to the new input data. If we maintain different tests with different oro_data_ss* and oro_data_ls* versions, then we won't actually know if the new data works across all applications. From Fanglin's comment, it is expected that the new input data should work even for tests which retain the older/different namelist settings.

I would suggest a new input-data-directory be created with the new ss and ls files to be used by all configurations. @yangfanglin @JessicaMeixner-NOAA @jkbk2004

@grantfirl
Copy link
Collaborator

grantfirl commented Jun 27, 2024

@jkbk2004 @dustinswales @JongilHan66 FYI, I've found a couple of bugs in the new code in drag_suite.F90 that were causing seg faults in certain configurations. I think that I have them fixed, but I'm running final tests of the cpld_control_gfsv17_intel on Hera test to make sure before I push the changes.

@grantfirl
Copy link
Collaborator

@jkbk2004 @dustinswales @JongilHan66 The cpld_control_gfsv17 test (and dependent tests) should run to completion now, but have different results. See ufs-community/ccpp-physics@c6dec90 for the memory problem fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. New Input Data Req'd This PR requires new data to be sync across platforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet