Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Update Jenkinsfile to run post-build functional tests #847

Merged
merged 18 commits into from
Jul 21, 2023

Conversation

BruceKropp-Raytheon
Copy link
Contributor

DESCRIPTION OF CHANGES:

This is a new CI gate to ensure that code changes still allow E2E tests to run on supported platforms.
It uses the existing CI functional test script, .cicd/scripts/srw_ftest.sh , to setup and try the first few workflow tasks.
If these succeed, we can be confident that E2E tests can still be launched as usual.

Type of change

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

TESTS CONDUCTED:

The CI script, srw_ftest.sh, was run on each of the Teir1 on-prem platforms, and succeeds on those that ALREADY succeed in running E2E tests via CI. Namely Cheyenne and Jet. Other platforms have open issues preventing E2E tests from running to completion vi CI.

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

EPIC PLATFORM-635

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (It is documented in the Jenkinsfile itself).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

@MichaelLueken
Copy link
Collaborator

MichaelLueken commented Jun 30, 2023

@BruceKropp-Raytheon A quick question regarding the Needs Cheyenne Test and Needs Jet Test labels that you requested - since you noted in the TESTS CONDUCTED section that the .cicd/scripts/srw_ftest.sh script was ran on Tier-1 platforms, is this more a request to add the run_we2e_coverage_tests label and ensure that the modifications to the Jenkinsfile allows the new Functional WorkflowTaskTests automated testing to also pass on Cheyenne, Hera, and Jet? Thanks!

@MichaelLueken MichaelLueken added run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests run_we2e_jenkins_coverage_tests SRW App automated CI testing with modified Jenkinsfile labels Jun 30, 2023
@MichaelLueken
Copy link
Collaborator

@BruceKropp-Raytheon I had to kill the CI tests because they weren't using your updated Jenkinsfile. I have created a temporary sandbox for testing this PR and the progress can be seen here:

https://jenkins.epic.oarcloud.noaa.gov/blue/organizations/jenkins/sandbox%2FSRW_App_Jenkinsfile_test/detail/PR-847/1/pipeline/

I'll let you know if I encounter any failures with the new Functional WorkflowTaskTests.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BruceKropp-Raytheon I'm finding that the new Functional WorkflowTaskTests are failing with the following message:

+ bash --login 'EXPTDIR=/glade/scratch/epicufsrt/jenkins/workspace/_SRW_App_Jenkinsfile_test_PR-847/expt_dirs/test_community /glade/scratch/epicufsrt/jenkins/workspace/_SRW_App_Jenkinsfile_test_PR-847/.cicd/scripts/srw_ftest.sh'

bash: EXPTDIR=/glade/scratch/epicufsrt/jenkins/workspace/_SRW_App_Jenkinsfile_test_PR-847/expt_dirs/test_community /glade/scratch/epicufsrt/jenkins/workspace/_SRW_App_Jenkinsfile_test_PR-847/.cicd/scripts/srw_ftest.sh: No such file or directory

script returned exit code 127

I think that you will need to remove the EXPTDIR entry from the Jenkinsfile. This is apparently causing issues and this variable is already being set in srw_ftest.sh itself.

.cicd/Jenkinsfile Outdated Show resolved Hide resolved
@MichaelLueken
Copy link
Collaborator

@BruceKropp-Raytheon The new Functional WorkflowTaskTests have passed on Cheyenne Intel and Cheyenne GNU. Will continue to monitor Hera Intel, Hera GNU, and Jet results.

@MichaelLueken
Copy link
Collaborator

@BruceKropp-Raytheon The new srw_ftest.sh script has been successfully tested on Jet and Orion via Jenkins. I've resubmitted the Jenkins tests on Hera. Once complete, I will approve these changes.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BruceKropp-Raytheon The srw_ftest.sh successfully completes on Hera, Jet, and Orion (though the test is skipped on Hera due to issues with ulimit -s unlimited) using the Jenkins pipeline.

Approving this PR now.

@natalie-perlin
Copy link
Collaborator

@BruceKropp-Raytheon @MichaelLueken
setting "ulimit -S -s unlimited" on Hera could also do the trick, increasing the stack to the max (soft) limit. Could this be coded in to be a solution for Hera?
ulimitSsHera

@BruceKropp-Raytheon
Copy link
Contributor Author

@natalie-perlin I think that would work.
It requires a change to ./ush/machine/Hera.yaml (line 26):

from: PRE_TASK_CMDS: '{ ulimit -s unlimited; ulimit -a; }'

to: PRE_TASK_CMDS: '{ ulimit -S -s unlimited; ulimit -a; }'

@MichaelLueken
Copy link
Collaborator

@BruceKropp-Raytheon @natalie-perlin The Functional WorkflowTaskTests are still failing on Hera due to using ulimit to attempt to modify the stack size:

/scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/_SRW_App_Jenkinsfile_test_PR-847/scripts/exregional_make_grid.sh: line 63: ulimit: stack size: cannot modify limit: Operation not permitted

For whatever reason, it looks like Hera tests will fail if you attempt to change the stack size via ulimit outside of the WE2E tests.

@BruceKropp-Raytheon
Copy link
Contributor Author

okay. So lets revert to the previous, cancel this last commit?

@MichaelLueken
Copy link
Collaborator

@BruceKropp-Raytheon Sure, if you use the following command, you will undo the last commit:

git reset --hard HEAD^1

Please note that Git might not allow you to push the update (it will complain about your local version being behind what is in the repository and to update it using git pull). If this happens, then you will need to add the --force flag to the push command.

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Jul 12, 2023

Successfully tested after the following changes to the ./.cicd/scripts/srw_ftest.sh (lines 128-130)

# Limit to machines that are fully ready
deny_machines=( gaea )
sed "s|ulimit -s unlimited;|ulimit -S -s unlimited;|" -i ${workspace}/ush/machine/hera.yaml

and the following in ./scripts/exregional_make_grid.sh:

echo "in exregional_make_grid.sh: PRE_TASK_CMDS=$PRE_TASK_CMDS"
#eval ${PRE_TASK_CMDS}
eval $PRE_TASK_CMDS

By some reason, removing curly brackets used with eval command made a difference.

The log file for make_grid task, run_make_grid-log.txt, showed the following:

This is the ex-script for the task that generates grid files.
========================================================================
in exregional_make_grid.sh: PRE_TASK_CMDS={ ulimit -S -s unlimited; ulimit -a; }
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 766586
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 6400
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 12000000
cpu time               (seconds, -t) 6000
max user processes              (-u) 512
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) 64

  All executables will be submitted with command \'time\'.
FATAL ERROR:
ERROR:
  From script:  "exregional_make_grid.sh"
  Full path to script:  "/scratch1/NCEPDEV/stmp2/Natalie.Perlin/SRW/srw-dev-pr847/scripts/exregional_make_grid.sh"
The executable (exec_fp) for generating the grid file does not exist:
  exec_fp = "/scratch1/NCEPDEV/stmp2/Natalie.Perlin/SRW/srw-dev-pr847/install_intel/exec/regional_esg_grid"
Please ensure that you've built this executable.
Exiting with nonzero status.
End exregional_make_grid.sh at Wed Jul 12 19:25:10 UTC 2023 with error code 1 (time elapsed: 00:00:00)
FATAL ERROR:
ERROR:
  From script:  "JREGIONAL_MAKE_GRID"
  Full path to script:  "/scratch1/NCEPDEV/stmp2/Natalie.Perlin/SRW/srw-dev-pr847/jobs/JREGIONAL_MAKE_GRID"
Call to ex-script corresponding to J-job "JREGIONAL_MAKE_GRID" failed.
Exiting with nonzero status.
End JREGIONAL_MAKE_GRID at Wed Jul 12 19:25:10 UTC 2023 with error code 1 (time elapsed: 00:00:00)

In other words, the script successfully changed the stack size to max allowed (soft) limit. Execution stopped only because it was looking for a binary executable under ./install_intel/exec/, whereas in my build it was located under ./exec/

@MichaelLueken
Copy link
Collaborator

@natalie-perlin @BruceKropp-Raytheon

Removing the {} from PRE_TASK_CMDS, in exregional_make_grid.sh, allowed the modified ulimit -S -s unlimited to run on Hera. The get_ics and get_lbcs tasks succeeded, due to no eval ${PRE_TASK_CMDS} line in exregional_get_extrn_mdl_files.sh. Interestingly, even the make_orog task succeeded. This was surprising because eval ${PRE_TASK_CMDS} is also used in exregional_make_orog.sh, but it didn't complain about changing the stack size for this task. Do you have any ideas why this would cause a failure in make_grid, but not make_orog?

I'd like to include this test for all machines, but I'd like to know why there is an issue only with the make_grid task and only on Hera before removing Hera from the deny_machines list.


# Limit to machines that are fully ready
deny_machines=( gaea hera )
#sed "s|ulimit -s unlimited;|ulimit -S -s unlimited;|" -i ${workspace}/ush/machine/hera.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change lines 129-130 to include hera:

deny_machines=( gaea )
sed "s|ulimit -s unlimited;|ulimit -S -s unlimited;|" -i ${workspace}/ush/machine/hera.yaml

and line 63 in ./scripts/exregional_make_grid.sh to the following (remove curly brackets):

eval $PRE_TASK_CMDS

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natalie-perlin and @BruceKropp-Raytheon -

I've found another way to make the srw_ftest.sh script run on Hera that doesn't require removing the curly brackets from ${PRE_TASK_CMDS} in scripts/exregional_make_grid.sh.

When I moved line 130:

sed "s|ulimit -s unlimited;|ulimit -S -s unlimited;|" -i ${workspace}/ush/machine/hera.yaml

to line 89 (this line can be moved to any line before the .generate_FV3LAM_wflow.py line), the wrapper script successfully ran on Hera.

@@ -86,6 +86,8 @@ set -e -u
 
 export PYTHONPATH=${workspace}/ush/python_utils/workflow-tools:${workspace}/ush/python_utils/workflow-tools/src
 
+sed "s|ulimit -s unlimited;|ulimit -S -s unlimited;|" -i ${workspace}/ush/machine/hera.yaml
+
 cd ${workspace}/ush
         # Consistency check ...
         #./config_utils.py -c ./config.yaml -v ./config_defaults.yaml -k "(\!rocoto\b)"

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   .cicd/scripts/srw_ftest.sh

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	build_intel/
	tests/build_test244350.out

no changes added to commit (use "git add" and/or "git commit -a")

# Try hera with the first few simple SRW tasks ...
./run_make_grid.sh ... COMPLETE
./run_get_ics.sh ... COMPLETE
./run_get_lbcs.sh ... COMPLETE
./run_make_orog.sh ... COMPLETE

I think that this modification would be better than removing the curly brackets in an ex-script. If you could try making this modification and double-checking that it will run manually, I would be greatly appreciative. Thanks!

@MichaelLueken
Copy link
Collaborator

@BruceKropp-Raytheon @natalie-perlin The srw_ftest.sh script is failing on Hera with the following error message:

TypeError: expected str, bytes or os.PathLike object, not Namespace

This is seen both in the pipeline and while manually testing the script. The traceback for the error is:

Traceback (most recent call last):
  File "/scratch2/NAGAPE/epic/Michael.Lueken/PR_847/ush/./generate_FV3LAM_wflow.py", line 852, in <module>
    expt_dir = generate_FV3LAM_wflow(USHdir, wflow_logfile)
  File "/scratch2/NAGAPE/epic/Michael.Lueken/PR_847/ush/./generate_FV3LAM_wflow.py", line 120, in generate_FV3LAM_wflow
    set_template(args)
  File "/scratch2/NAGAPE/epic/Michael.Lueken/PR_847/ush/python_utils/workflow-tools/scripts/templater.py", line 108, in set_template
    log = cli_helpers.setup_logging(user_args, log_name=name)
  File "/scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/workflow-tools_pipeline_develop/src/uwtools/utils/cli_helpers.py", line 61, in setup_logging
    log = Logger(
  File "/scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/workflow-tools_pipeline_develop/src/uwtools/logger.py", line 94, in __init__
    self.get_file_handler(log_file, level=self.level, fmt=self.fmt)
  File "/scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/workflow-tools_pipeline_develop/src/uwtools/logger.py", line 128, in get_file_handler
    logfile_path = Path(logfile_path)
  File "/scratch1/NCEPDEV/nems/role.epic/miniconda3/4.12.0/envs/workflow_tools/lib/python3.9/pathlib.py", line 1082, in __new__
    self = cls._from_parts(args, init=False)
  File "/scratch1/NCEPDEV/nems/role.epic/miniconda3/4.12.0/envs/workflow_tools/lib/python3.9/pathlib.py", line 707, in _from_parts
    drv, root, parts = self._parse_args(args)
  File "/scratch1/NCEPDEV/nems/role.epic/miniconda3/4.12.0/envs/workflow_tools/lib/python3.9/pathlib.py", line 691, in _parse_args
    a = os.fspath(a)

It should be noted that I'm able to make the srw_test.sh test works without issue, which needs to run ./generate_FV3LAM_wflow.py to actually generate and kick off the workflow. Additionally, the script still works without issue on Orion, but failing with the exact same failure on Jet. Did something happen on Hera and Jet on Friday that would adversely affect workflow generation?

It's not currently clear to me what is causing the issue. I'll work some on this tomorrow and hopefully figure something else (unless you are able to figure out the issue sooner).

@MichaelLueken
Copy link
Collaborator

@BruceKropp-Raytheon @natalie-perlin -

Looking more closely at the traceback, I'm seeing something that I don't understand - The ush/generate_FV3LAM_wflow.py script is going into ush/python_utils/workflow-tools/scripts/templater.py. From this point, on both Hera and Jet, instead of going into ush/python_utils/workflow-tools/src/uwtools/utils/cli_helpers.py, it is going into jenkins/workspace/workflow-tools_pipeline_develop/src/uwtools/utils/cli_helpers.py. Interestingly, the last merge to the workflow-tools develop branch marks the last time that the .cicd/scripts/srw_ftest.sh script successfully ran.

@christinaholtNOAA Is this the right behavior, or should the work stay in ush/python_utils/workflow-tools/src?

I'll do some more digging to see why the .cicd/scripts/srw_ftest.sh script is changing to the Jenkins workspace and not staying in the local copy.

@BruceKropp-Raytheon
Copy link
Contributor Author

@MichaelLueken I have gone back and rerun manually from a previous commit that we believed was working correctly last week (July 12), 4bcfa8e . This commit builds/tests fine on Orion, but is now failing on Jet, as you have pointed out above. It seems unrelated to the latest changes in this PR, so perhaps there was an environment setting change on Jet (and Hera)?

@MichaelLueken
Copy link
Collaborator

@BruceKropp-Raytheon -

You are correct, I don't think that this PR is causing the issue, at least directly. The issue is that the srw_ftest.sh script is starting with the ush/python_utils/workflow-tools/scripts/templater.py, but then jumps to a newer version of workflow-tools in the Jenkins workspace. The newer version no longer works with the older templater script, leading to the failure we are seeing now. Once we figure out how to keep workflow generation out of the Jenkins workspace location, this PR will once again run correctly. I'm just surprised that the WE2E coverage tests are correctly running on both Hera and Jet. However the srw_test.sh script is doing this, we will need to mimic in srw_ftest.sh so that this issue is nipped.

@BruceKropp-Raytheon
Copy link
Contributor Author

okay. We currently have PYTHONPATH defined in the script. Does this need adjusting?

conda activate workflow_tools
set -e -u

export PYTHONPATH=${workspace}/ush/python_utils/workflow-tools

cd ${workspace}/ush
        # Consistency check ...
        #./config_utils.py -c ./config.yaml -v ./config_defaults.yaml -k "(\!rocoto\b)"
        # Generate workflow files ...
        ./generate_FV3LAM_wflow.py
cd ${workspace}

@MichaelLueken
Copy link
Collaborator

The PYTHONPATH entry needs to point to that path before calling ush/generate_FV3LAM_wflow.py, otherwise the script will fail to find scripts/templater.py in ush/python_utils/workflow-tools. The issue is, after finding scripts/templater.py, the PYTHONPATH needs to be updated to point to ush/python_utils/workflow-tools/src. In srw_ftest.sh, you are loading wflow_${machine}, which is loading a special modulefile, set_pythonpath. This modulefile should already load the necessary PYTHONPATH entries, but this is obviously not happening on Hera and Jet.

@BruceKropp-Raytheon
Copy link
Contributor Author

Understood. The motivation for this PR was to exercise the User Documentation described in 5.4.2 Running the Workflow Using Stand-Alone Scripts . With the setup instructions coming from 5.3 Generate the Forecast Experiment . Obviously we have had to make some adjustments in the script to produce success for this PR. Ultimately the User documentation may need adjusting to reflect the correct way to perform these tasks on any supported tier1 platform.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BruceKropp-Raytheon Appending the following path to the PYTHONPATH entry will allow the srw_ftest.sh to run on Jet:

${workspace}/ush/python_utils/workflow-tools/src

I will be able to resubmit the Jet and Orion Jenkins tests after this change has been made to ensure that the pipeline is looking good, then I can run the Hera tests tomorrow once the machine returns.

.cicd/scripts/srw_ftest.sh Outdated Show resolved Hide resolved
@MichaelLueken
Copy link
Collaborator

@BruceKropp-Raytheon @natalie-perlin The Functional WorkflowTaskTests have successfully passed on both Hera Intel and Hera GNU in the Jenkins CI pipeline! I think we should be good to go now.

@MichaelLueken
Copy link
Collaborator

Automated Jenkins tests have completed for Hera. All tests on Hera GNU successfully passed. For Hera Intel, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 and grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 tests failed. A rerun shows that all tests passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200          COMPLETE               6.42
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             790.55
get_from_HPSS_ics_HRRR_lbcs_RAP                                    COMPLETE              15.04
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp  COMPLETE              11.73
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE              10.22
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16          COMPLETE              15.80
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE              10.77
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE               7.57
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         COMPLETE             255.45
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             311.94
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             336.43
pregen_grid_orog_sfc_climo                                         COMPLETE              10.29
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1782.21

The automated Jenkins tests have been submitted on Jet. Once again, the Functional WorkflowTaskTests have successfully passed on the machine. Once the tests complete and @natalie-perlin has given her approval, this work can be merged.

Copy link
Collaborator

@natalie-perlin natalie-perlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent news! Thank you for testing the recent changes. Approving now

@MichaelLueken MichaelLueken merged commit e565518 into ufs-community:develop Jul 21, 2023
4 checks passed
@BruceKropp-Raytheon BruceKropp-Raytheon deleted the feature/add_ftest_gate branch September 19, 2023 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests run_we2e_jenkins_coverage_tests SRW App automated CI testing with modified Jenkinsfile Tested on Cheyenne Successfully tested on Cheyenne machine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants