Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression Testing Updates: GNU/Intel into one RT. New logs/ dir in tests/ #1718

Merged
merged 80 commits into from
May 31, 2023

Conversation

BrianCurtis-NOAA
Copy link
Collaborator

@BrianCurtis-NOAA BrianCurtis-NOAA commented Apr 20, 2023

Description

rt.conf updates:

  • Combine Intel and GNU tests into one conf file.
  • Added section to COMPILE line to hard-code compile numbers.
  • Added section to COMPILE line to specify compiler.
  • Ordered RUN lines globally for easier GREP view.
  • Changed fv3 in RUN lines to "baseline" for easier understanding of what it does.

rt.sh updates:

  • -a option added, need to use -a unless ACCNR is set in environment
  • Fixes for some identified potential issues in scripts. Output text should be less repetative.
  • Removal of INTEL/GNU baseline develop dirs.
  • Test Run Dir's names updated with which compiler is being used.
  • Logs will now contain the hashes of the UFSWM and its submodules.
  • moved BL_DATE outside rt.sh into bl_date.conf

bl_date.conf:

  • moved BL_DATE into here for easier code management, identify actual rt.sh changes vs BL_DATE change each PR.

Log updates:

  • RegressionTests_.log and opnReqTest_.log now go into log directory.
  • Log files now RegressionTests_<machine>.log instead of RegressionTests_<machine>.<compiler>.log.

Baseline storage updates:

  • develop-2023XXXX/INTEL|GNU/<test_dir> now are just develop-2023XXXX/<test_dir>.
  • Run/Test Dir's now have RT_COMPILER appended to them (i.e. control_c48_intel and control_c48_gnu).
  • STMP/PTMP baseline storage now REGRESSION_TEST instead of REGRESSION_TEST_INTEL or REGRESSION_TEST_GNU.

opnReqTest updates:

  • -a option added to force users to specify account to use on HPC
  • Tests that are not compatible yet with ORT tests like dcmp, thrd, fhzero etc, will now auto skip themselves instead of failing the whole suite. So running ORT should be as easy as ./opnReqTest -e -n regional_control

PR template update:

  • The PR template was updated to what is shown here. Removed the Intel/GNU sections from Testing Log.

What I have tested:

  • Full RT Suite on hera
  • ORT with regional_control
  • rt_weekly (./rt.sh -e -w -l rt_weekly.conf)
  • AutoRT on Hera/Cheyenne
  • Jenkins-CI
  • new -a option in rt.sh and opnReqTest

What I need to test:

  • all testing completed

Input data additions/changes

  • No changes are expected to input data.
  • Changes are expected to input data:
    • New input data.
    • Updated input data.

Anticipated changes to regression tests:

  • No changes are expected to any regression test.
  • Changes are expected to the following tests:

This should not change answers to any test, but the whole regression testing system is getting name changes, so all baselines will need to be regenerated with the new code changes and new baseline-dirs will need to be created.

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Combined with PR's (If Applicable):

Commit Queue Checklist:

  • Link PR's from all sub-components involved in section below
  • Confirm reviews completed in ALL sub-component PR's
  • Add all appropriate labels to this PR.
  • Run full RT suite on either Hera/Cheyenne AND attach log to a PR comment.
  • Add list of any failed regression tests to "Anticipated changes to regression tests" section.

Linked PR's and Issues:

Testing Day Checklist:

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.

Testing Log (for CM's):

  • RDHPCS
    • Hera
    • Orion
    • Jet
    • Gaea
    • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@BrianCurtis-NOAA BrianCurtis-NOAA mentioned this pull request Apr 20, 2023
36 tasks
@BrianCurtis-NOAA
Copy link
Collaborator Author

BrianCurtis-NOAA commented Apr 20, 2023

To prove tests don't change baselines for this PR will be manual due to the fact that the baselines directory is changing a bit. I have used the following script to compare baselines:

#! /bin/bash

TESTDIR=/scratch1/NCEPDEV/stmp4/Brian.Curtis/FV3_RT/REGRESSION_TEST_GOOD/
TESTS=`ls ${TESTDIR}`
BLDIR=/scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/develop-20230413/

#echo $TESTDIRS

for i in $TESTS
do
  TEST=${i%_*}
  CMPLR=${i##*_}
  CMPDIR=${BLDIR}${CMPLR^^}/${TEST}
  echo "Comparing files for ${CMPLR^^} test: ${TEST}"
  for j in `ls ${CMPDIR}`
  do
    if [[ ${j} == "RESTART" ]]; then
      for k in `ls ${CMPDIR}/$j`
      do
        if  [[ ${k##*.} =~ nc* ]]; then
          echo "--> USING NCCMP ON FILE: ${j}/${k} "
          nccmp -d -f -g -B --Attribute=checksum --warn=format ${TESTDIR}/$i/$j/$k ${CMPDIR}/$j/$k || echo -e "\n"
        else
          echo "--> USING CMP ON FILE: ${j}/${k}"
          cmp ${TESTDIR}/$i/$j/$k ${CMPDIR}/$j/$k || echo -e "\n"
        fi
        
      done
      
    elif  [[ ${j##*.} =~ nc* ]]; then
      echo "--> USING NCCMP ON FILE: ${j}"
      nccmp -d -f -g -B --Attribute=checksum --warn=format ${TESTDIR}/$i/$j ${CMPDIR}/$j || echo -e "\n"
    else
      echo "--> USING CMP ON FILE: ${j}"
      cmp ${TESTDIR}/$i/$j ${CMPDIR}/$j || echo -e "\n"
    fi

  done
done

And here is it's results:
comparisons.txt

@BrianCurtis-NOAA BrianCurtis-NOAA added the Baseline Updates Current baselines will be updated. label Apr 20, 2023
@FernandoAndrade-NOAA
Copy link
Collaborator

@BrianCurtis-NOAA I'm going to get started with jenkins-ci on this PR

@FernandoAndrade-NOAA FernandoAndrade-NOAA added the jenkins-ci Jenkins CI: ORT build/test on docker container label Apr 20, 2023
tests/rt.sh Outdated Show resolved Hide resolved
tests/run_test.sh Outdated Show resolved Hide resolved
tests/run_test.sh Outdated Show resolved Hide resolved
tests/run_test.sh Outdated Show resolved Hide resolved
@FernandoAndrade-NOAA
Copy link
Collaborator

Attached are the jenkins ci logs; the tests failed and we're looking into this.
ufs-weather-model » ort-docker-pipeline » PR-1718 #1 Console [Jenkins].pdf

@BrianCurtis-NOAA
Copy link
Collaborator Author

@FernandoAndrade-NOAA

  • The BL_DIR was (develop-YYYYMMDD//<test_names>) is now just (develop-YYYYMMDD/<test_names>).
  • Test dirs were <test_name>/files are now <test_name>_${RT_COMPILER}/files where ${RT_COMPILER} is either intel or gnu (you probably don't need to worry about GNU, since it's unsupported for WCOSS2 operations.)

If you need me to look through what you're doing, i'd be happy to.

@zach1221
Copy link
Collaborator

zach1221 commented Apr 21, 2023

I'm trying to test ORT manually against this PR on Hera, and receiving unbound variable error for "TEST_NAME". I get this error with both intel and gnu.
image

Edit* (I should add this occurs on the std_base)

@zach1221
Copy link
Collaborator

@FernandoAndrade-NOAA ORTs passed when I ran them manually against this PR, following Brian's changes. I think you can try jenkins-ci again, when you're able.

@zach1221
Copy link
Collaborator

@BrianCurtis-NOAA update on Orion. All the RTs are complete except for a failure with regional_atmaq_faster case. The rest are good. However, it looks like Orion is down currently so I'm unable to troubleshoot the failed case at the moment. The baselines were created and copied over fine on Orion, so it may be a good idea to skip the logs on for this HPC, if Orion remains down the rest of the day. @jkbk2004 fyi.

@jkbk2004
Copy link
Collaborator

@BrianCurtis-NOAA update on Orion. All the RTs are complete except for a failure with regional_atmaq_faster case. The rest are good. However, it looks like Orion is down currently so I'm unable to troubleshoot the failed case at the moment. The baselines were created and copied over fine on Orion, so it may be a good idea to skip the logs on for this HPC, if Orion remains down the rest of the day. @jkbk2004 fyi.

@zach1221 Sounds like unexpected orion network issue. all orion baselines were created ok, right? I agree we may move on to merge unless orion is back online next a few hours.

@zach1221
Copy link
Collaborator

Yes, the baselines created fine. I'm not certain why that last regional_atmaq case failed in the RT/matching. As soon as I started to dig through the logs I was kicked out.

@BrianCurtis-NOAA
Copy link
Collaborator Author

I think its justified to skip Orion, but I recommend creating an issue to check why once Orion is back and addressing anything in that issue.

@jkbk2004 jkbk2004 self-requested a review May 31, 2023 15:24
@zach1221
Copy link
Collaborator

@BrianCurtis-NOAA I agree. I'll get the issue created now, so that when Orion is back up I can ensure everything is tracked and cleaned up.

jkbk2004
jkbk2004 previously approved these changes May 31, 2023
@zach1221
Copy link
Collaborator

Ok, we have approvals. I'm proceeding with merging.

@zach1221
Copy link
Collaborator

Looks like Orion just came back up. Checking that last atmaq test.

@zach1221
Copy link
Collaborator

regional_atmaq_faster passed on Orion, after extending the wallclock, so looks like it was just timing out. I'll push the orion RT logs and then merge.

@jkbk2004 jkbk2004 requested review from FernandoAndrade-NOAA and removed request for FernandoAndrade-NOAA May 31, 2023 19:10
@zach1221 zach1221 merged commit 5d47ea8 into ufs-community:develop May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. jenkins-ci Jenkins CI: ORT build/test on docker container
Projects
None yet
10 participants