Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore error checking in regression test system. #2335

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

SamuelTrahanNOAA
Copy link
Collaborator

@SamuelTrahanNOAA SamuelTrahanNOAA commented Jun 21, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • N/A No subcomponents. All sub component pull requests have been reviewed by their code managers.
  • See description. Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • See description. Commit 'test_changes.list' from previous step

Description:

The regression test system ignores all errors in all jobs. A job where fv3.exe crashed is the same as a job where it ran, and produced different results. Also, compilation job errors are ignored. This leads to several problems:

  1. Test jobs are executed even if the compile job fails.
  2. Jobs with prerequisites (such as restart tests) run even if their prerequisite fails.
  3. A job that failed to copy input data or had syntax errors won't be caught until the entire workflow completes.
  4. Temporary system issues require rerunning the entire workflow instead of only affected jobs.

In this new version of the regression test system:

  1. Errors are caught, and result in the metascheduler considering the job as failed.
  2. Dependencies are honored; if a job fails, anything that depends on it won't run.
  3. Jobs that run to completion, but have changed results are considered to have succeeded. This behavior is unchanged.

Temporary Changes to Make Some Tests Fail

To test this feature, I've modified a few jobs so they break or have changed results. This should be reverted before merging to develop:

  1. rrfs_v1beta_failing - This new test will always fail at runtime.
  2. compile_atm_faster_dyn32_intel - Removed the -DFASTER=ON. This means its tests will succeed, but results will change. The test_changes.txt will contain control_wam intel and control_wam_debug intel.
  3. compile_hafsw_intel - Will always fail at runtime due to the new --invalid-argument argument. Tests that require this compilation will never run.

IMPORTANT: These changes are marked with # FIXME and should be reverted before merging to develop.

Commit Message:

* UFSWM - restore error checking to regression test system

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

N/A

UFSWM Blocking Dependencies:

N/A


Changes

Regression Test Changes (Please commit test_changes.list):

  • No Baseline Changes.

Some deliberate errors must be removed before this PR will run to completion without changing baselines. They are marked with # FIXME.

Input data Changes:

  • None.

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@SamuelTrahanNOAA
Copy link
Collaborator Author

Pinging @DeniseWorthen who authored the relevant issue. Also, @DusanJovic-NOAA who authored the original regression test system.

@SamuelTrahanNOAA
Copy link
Collaborator Author

SamuelTrahanNOAA commented Jun 21, 2024

I've been testing this on top of #2326. The regression test has proven itself completely unusable for development due to the lack of error checking. Updating UPP and modulefiles required many changes that caused subsets of the tests to fail. A regression test system that is unable to differentiate between a test with changed results, and a test that could not run at all, is not useful for development.

@jkbk2004
Copy link
Collaborator

@SamuelTrahanNOAA Can you follow up to clean the super-linter complaint ?

@SamuelTrahanNOAA
Copy link
Collaborator Author

After updating this branch, I'm getting out-of-memory errors from some jobs on Hera when using Rocoto.

@SamuelTrahanNOAA
Copy link
Collaborator Author

After updating this branch, I'm getting out-of-memory errors from some jobs on Hera when using Rocoto.

The jobs succeeded on the second attempt. This may have been a temporary system issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

modify report of test failures to clearly indicate when a test failed to compare because it did not run
2 participants