Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update AQM RT's to v16 and update input-data (Merged with Bring AQM changes from production/AQM.v7 into develop branch #2279) #2287

Open
wants to merge 22 commits into
base: develop
Choose a base branch
from

Conversation

BrianCurtis-NOAA
Copy link
Collaborator

@BrianCurtis-NOAA BrianCurtis-NOAA commented May 17, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

  • New input data to align with AQM v7
  • Update AQM RT's to use gfs v16 instead of gfs v15p2.
  • Bring in production/AQM.v7 changes

Commit Message:

* UFSWM:
  * New input data to align with AQM v7
  * Update AQM RT's to use gfs v16 instead of gfs v15p2.
* AQM:
  * Bring in production/AQM.v7 changes to develop branch.

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

UFSWM Blocking Dependencies:

  • None

Changes

Regression Test Changes (Please commit test_changes.list):

  • PR Updates/Changes Baselines.

Input data Changes:

  • New input data:
    /scratch2/NAGAPE/epic/UFS-WM_RT/NEMSfv3gfs/input-data-20240501/AQMv7

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@BrianCurtis-NOAA
Copy link
Collaborator Author

I need to reduce the resources of this setup before being moved to ready.

@BrianCurtis-NOAA BrianCurtis-NOAA marked this pull request as ready for review May 21, 2024 21:54
@BrianCurtis-NOAA
Copy link
Collaborator Author

I've just tested the two atmaq on hera to make sure the merge didn't impact them and they were able to complete successfully. Now i'm running the full suite on hera to create an up-to-date test_changes.list for this PR.

@BrianCurtis-NOAA BrianCurtis-NOAA changed the title Update AQM RT's to v16 and update input-data Update AQM RT's to v16 and update input-data (Merged with Bring AQM changes from production/AQM.v7 into develop branch #2279) Jun 12, 2024
@BrianCurtis-NOAA
Copy link
Collaborator Author

I am going to get a review completed for the AQM subcomponent. Please feel free to start testing. There is new input data already staged on hera

/scratch2/NAGAPE/epic/UFS-WM_RT/NEMSfv3gfs/input-data-20240501/AQMv7

@BrianCurtis-NOAA BrianCurtis-NOAA added Baseline Updates Current baselines will be updated. New Input Data Req'd This PR requires new data to be sync across platforms New Baselines New baselines will be added to project. labels Jun 12, 2024
@BrianCurtis-NOAA
Copy link
Collaborator Author

Also please remove any old aqm baselines from this new bl_date as the names of the tests have changes and we no longer need the old ones.

@BrianCurtis-NOAA
Copy link
Collaborator Author

BrianCurtis-NOAA commented Jun 12, 2024

@FernandoAndrade-NOAA @jkbk2004 @zach1221 move AQMv7 onto other RDHPCS if not already and please feel free to start testing.

@jkbk2004
Copy link
Collaborator

@zach1221 @FernandoAndrade-NOAA New input files AQMv7 are ready at orion/hercules/gaea/derecho. I will check if we can use lfs5 on jet.

@zach1221 zach1221 added the Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. label Jun 12, 2024
@zach1221
Copy link
Collaborator

Anyone else having issues getting the new regional_atmaq_v16_debug case to pass baseline creation? It failed for me on hercules and derecho, but it doesn't appear to be wallclock. Idk @BrianCurtis-NOAA if you're able to take a look. Here's my error log
/glade/derecho/scratch/zshrader/FV3_RT/rt_27722/regional_atmaq_v16_debug_intel/err
forrtl: severe (408): fort: (33): Shape mismatch: The extent of dimension 1 of array PRESF is 25 and the corresponding extent of array MET_DATA is 65

@FernandoAndrade-NOAA
Copy link
Collaborator

FernandoAndrade-NOAA commented Jun 12, 2024

Gaea is running into issues with the regional_atmaq_v16_debug_intel. It looks like Zach is running into similar errors during his run:

372: forrtl: severe (408): fort: (33): Shape mismatch: The extent of dimension 1 of array PRESF is 25 and the corresponding extent of array MET_DA
TA is 65
372:
372: Image              PC                Routine            Line        Source
372: fv3.exe            00000000025F9180  pt3d_stks_defn_mp         232  PT3D_STKS_DEFN.F
372: fv3.exe            00000000025ECA9E  pt3d_defn_mp_get_         118  PT3D_DEFN.F
372: fv3.exe            000000000232FBF4  emis_defn_mp_get_         540  EMIS_DEFN.F
372: fv3.exe            0000000001F55B3F  vdiff_                    348  vdiffproc.F
372: fv3.exe            0000000001B5FCBE  cmaq_mod_mp_cmaq_         160  cmaq_mod.F90
372: fv3.exe            0000000001B5E1CC  cmaq_model_mod_mp         130  cmaq_model_mod.F90
372: fv3.exe            0000000001ADCD6E  aqm_comp_mod_mp_a         202  aqm_comp_mod.F90
372: fv3.exe            0000000001ADAEFB  aqm_mp_modeladvan         584  aqm_cap.F90
372: fv3.exe            0000000000CFF1B8  Unknown               Unknown  Unknown
...
372: fv3.exe            0000000000430C2E  MAIN__                    406  UFS.F90
372: fv3.exe            000000000042D0FD  Unknown               Unknown  Unknown
372: libc-2.31.so       00007F3375A3E24D  __libc_start_main     Unknown  Unknown
372: fv3.exe            000000000042D02A  Unknown               Unknown  Unknown

/gpfs/f5/epic/scratch/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_89377/regional_atmaq_v16_debug_intel/err

@BrianCurtis-NOAA
Copy link
Collaborator Author

I was able to run to completion on Hera Acorn and WCOSS2. Hmm. I'll chat with the AQM devs to see if that function call can be avoided or the issue fixed. Could you try the debug once more if you haven't done it twice yet? Then if still fails, what are the lib differences between the failed machines and hera?

@zach1221
Copy link
Collaborator

I was able to run to completion on Hera Acorn and WCOSS2. Hmm. I'll chat with the AQM devs to see if that function call can be avoided or the issue fixed. Could you try the debug once more if you haven't done it twice yet? Then if still fails, what are the lib differences between the failed machines and hera?

Same result for me on follow up attempts. Looking through libraries on hercules vs hera to see if anything stands out. Will keep you posted.

@FernandoAndrade-NOAA
Copy link
Collaborator

FernandoAndrade-NOAA commented Jun 12, 2024

Same error on Gaea from my rerun unfortunately.

@BrianCurtis-NOAA
Copy link
Collaborator Author

How is the Hera run coming along? Did baselines generate OK?

@FernandoAndrade-NOAA
Copy link
Collaborator

How is the Hera run coming along? Did baselines generate OK?

Yes sorry about that I missed leaving a note on that. Generation was fine, the queue is just slow today. There's still about 94 tasks left.

@BrianCurtis-NOAA
Copy link
Collaborator Author

So we have Hera/Acorn/WCOSS2 as known working for the debug test. Derecho Hercules and Gaea as failing.

Since the error is on the shape of an array, I am curious if maybe there was a corruption that occurred while rsync-ing to the RDHPCS platforms. Could you try to re-rsync on a fast-ish platform (that fails) and see if the debug test runs to completion?

@zach1221
Copy link
Collaborator

So we have Hera/Acorn/WCOSS2 as known working for the debug test. Derecho Hercules and Gaea as failing.

Since the error is on the shape of an array, I am curious if maybe there was a corruption that occurred while rsync-ing to the RDHPCS platforms. Could you try to re-rsync on a fast-ish platform (that fails) and see if the debug test runs to completion?

I'll try resyncing with Derecho.

@jkbk2004
Copy link
Collaborator

So we have Hera/Acorn/WCOSS2 as known working for the debug test. Derecho Hercules and Gaea as failing.

Since the error is on the shape of an array, I am curious if maybe there was a corruption that occurred while rsync-ing to the RDHPCS platforms. Could you try to re-rsync on a fast-ish platform (that fails) and see if the debug test runs to completion?

@BrianCurtis-NOAA rsyncing again on gaea now.... @zach1221 @FernandoAndrade-NOAA FYI

@FernandoAndrade-NOAA
Copy link
Collaborator

Just leaving a note that the error is still occurring on the Gaea rerun.

@jkbk2004
Copy link
Collaborator

@BrianCurtis-NOAA hold this pr a bit and move to #2279 ?

@BrianCurtis-NOAA
Copy link
Collaborator Author

Were we able to see if Jet was effected by the issue?

@FernandoAndrade-NOAA
Copy link
Collaborator

Were we able to see if Jet was effected by the issue?

I have not tried Jet, from what I understood we were going to wait for results from Gaea to see if it was just a sync issue. Would you like me to start up a run on Jet or should we move on?

@FernandoAndrade-NOAA
Copy link
Collaborator

Given that Derecho and Hercules also failed, I'd suggest we move onto the next PR.

@jkbk2004
Copy link
Collaborator

@BrianCurtis-NOAA I expect #2283 will be smooth only with WAM cases change. We can revisit AQM PRs either tomorrow or Monday.

@BrianCurtis-NOAA
Copy link
Collaborator Author

OK, move on. Please look into why Hera/Acorn/WCOSS2 pass while the others are not. Hopefully you can identify potential problems spots by comparing the different compiler/library versions between say Hera and Gaea.

@jkbk2004 jkbk2004 removed the Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. label Jun 14, 2024
@jkbk2004
Copy link
Collaborator

@BrianCurtis-NOAA I think we need a fix at https://github.com/BrianCurtis-NOAA/AQM/blob/b79f95f7de95b431feb74400aebb3a57e992c759/src/model/src/PT3D_STKS_DEFN.F#L232.

PRESF( 0:EMLYRS ) = CNVPA2MB * Met_Data % PRESF( C,R,1:EMLYRS+1 )

It runs ok on gaea once the line is updated. @zach1221 @FernandoAndrade-NOAA It might be worth to check on derencho. I don't expect any impact from the line update but we need to confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. New Baselines New baselines will be added to project. New Input Data Req'd This PR requires new data to be sync across platforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update ATMAQ tests to v16
4 participants