Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NLDAS-2 netCDF-4 LIS metforcing reader #1578

Merged
merged 12 commits into from
Jul 25, 2024

Conversation

dmocko
Copy link
Contributor

@dmocko dmocko commented Jul 16, 2024

This pull request adds the NLDAS-2 v020 netCDF-4 LIS metforcing reader into the LIS code.

As the v002 GRIB-1 NLDAS-2 data and the v020 netCDF-4 NLDAS-2 data are nearly identical (within round-off error) - the two metforcing readers in LIS also produce nearly identical results.

Note that as part of this pull request, the "Met forcing sources:" option for the NLDAS-2 v002 GRIB-1 data has changed from "NLDAS2" to "NLDAS2 grib". To use the new v020 netCDF-4 metforcing reader, please use "NLDAS2 netcdf" for "Met forcing sources:".

Resolves: #1577

This pull request adds the NLDAS-2 v020 netCDF-4 LIS metforcing
reader into the LIS code.

As the v002 GRIB-1 NLDAS-2 data and the v020 netCDF-4 NLDAS-2 data
are nearly identical (within round-off error) - the two metforcing
readers in LIS also produce nearly identical results.

Note that as part of this pull request, the "Met forcing sources:"
option for the NLDAS-2 v002 GRIB-1 data has changed from "NLDAS2"
to "NLDAS2 grib".  To use the new v020 netCDF-4 metforcing reader,
please use "NLDAS2 netcdf" for "Met forcing sources:".

Resolves: NASA-LIS#1577
@dmocko dmocko added enhancement New feature or request HighPriority Documentation Update to documentation (User's Guide or within the code) metforcing labels Jul 16, 2024
@dmocko
Copy link
Contributor Author

dmocko commented Jul 16, 2024

Testcase directory forthcoming.

v020 netCDF-4 FORA data: https://doi.org/10.5067/THUF4J1RLSYG

v020 netCDF-4 FORB data: https://doi.org/10.5067/96S0R3LFOBTU

@dmocko
Copy link
Contributor Author

dmocko commented Jul 16, 2024

Testcase available here: /discover/nobackup/projects/lis/tools/pull_requests/1578

@jvgeiger jvgeiger self-assigned this Jul 18, 2024
@dmocko dmocko requested review from jvgeiger and emkemp July 18, 2024 16:51
@jvgeiger
Copy link
Contributor

I am reviewing it.

@jvgeiger
Copy link
Contributor

Hello David. Would you please rework the "REVISION HISTORY" of the new files?

For example, in timeinterp_nldas20.F90, from

! !REVISION HISTORY:
! 02 Feb 2004: Sujay Kumar;  Initial Specification
! 24 Aug 2007: Chuck Alonge; Modified for use with NLDAS-2 data
! 14 Mar 2014: David Mocko:  Added CAPE and PET forcing from NLDAS-2
! 11 Jul 2024: David Mocko,  Modified for use with netCDF-4 format

to

! !REVISION HISTORY:
! 11 Jul 2024: David Mocko,  Initial Specification

or to

! !REVISION HISTORY:
! 11 Jul 2024: David Mocko,  Initial Specification (derived from timeinterp_nldas2.F90)

Thanks.

@dmocko
Copy link
Contributor Author

dmocko commented Jul 19, 2024

Will do

Hello David. Would you please rework the "REVISION HISTORY" of the new files?

Will do.

dmocko and others added 5 commits July 19, 2024 11:55
These are not strictly speaking required, but it removes warning messages
from ifort when strict checking is turned on.
This is to address warning messages from ifort when run with strict checks.
@emkemp
Copy link
Contributor

emkemp commented Jul 22, 2024

Hi @dmocko:

I'm trying to run your code, but I'm confused by the multiple config files. (What is FORB?) Also, I notice the lis.config.nldas.netcdf* files point to /css/nldas/NLDAS2. Do you have a batch script with the SLURM settings to access that directory from the compute nodes?

@emkemp
Copy link
Contributor

emkemp commented Jul 22, 2024

@dmocko Disregard the question about FORB, you answered in in the README file. (Note to all: RTFM.)

@emkemp
Copy link
Contributor

emkemp commented Jul 22, 2024

@dmocko You can ignore my other question about /css access. LIS can see it just fine on the Milan nodes. (I was probably thinking about a special node to write to /css, but that is not needed for these test cases.)

Notable finding: LIS will typically crash when launched with mpirun with a MPI error:

Abort(1090063) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack:
MPIR_Init_thread(189)...:
MPID_Init(1547).........:
MPIDU_Init_shm_init(180): unable to allocate shared memory

But I don't see that when I use srun. Not sure what the problem is (the domain is 468x228).

@emkemp
Copy link
Contributor

emkemp commented Jul 22, 2024

I get bit-wise identical results for the netCDF target output, both w/ and w/o FORB secondary forcing.

Identical results are produced.
No changes in results.
@dmocko
Copy link
Contributor Author

dmocko commented Jul 23, 2024

Hi @emkemp

I was traveling yesterday, and wasn't able to respond to your questions.

I added a SLURM script to the pull request directory:
/discover/nobackup/projects/lis/tools/pull_requests/1578/nldas.job

It does include this line: #SBATCH --constraint="mil&cssro"

https://www.nccs.nasa.gov/nccs-users/instructional/using-discover/file-system-storage/discover-css

Based on the link above, the "cssro" constraint is listed as being needed, but it's possible by now that by default, all Milan SCU17 nodes have CSS read-only access by default. I had run this job initially without adding the "cssro" constraint.

The "cssrw" constraint is only available for "datamove" nodes.

Are you still getting that crash with mpirun? Can you try the above SLURM script as well?

@dmocko
Copy link
Contributor Author

dmocko commented Jul 23, 2024

But I don't see that when I use srun. Not sure what the problem is (the domain is 468x228).

Minor point, but the NLDAS-2 domain is 464 by 224.

@dmocko
Copy link
Contributor Author

dmocko commented Jul 23, 2024

What is the reason for changing the indentation in the code in commit "0993d3c" and the follow-on commit "303dfe1"?

Is it just personal preference, or do these changes add something to the code?

If it's just personal preference - I would prefer that these commits are backed-out, as I wrote this code, and will maintain it going forward.

@emkemp
Copy link
Contributor

emkemp commented Jul 23, 2024

Hi @dmocko

I admit "303dfe1" was probably me being too opinionated. I have no objection to reverting that commit (though I'll need Jim's help on doing that for both GitHub and my local repo on Discover).

But I'm digging in on "0993d3c". Fixed format wastes space, and interferes with seeing how code is nested within the larger file. This is particularly true in nldas20_forcingMod.F90, where most of the code inside each subroutine has the same indentation as the subroutine declaration, which in turn have the same indentation as the parent module declaration. Also, member variables in the nldas20_type_dec type have the same indentation as the type declaration (this specific code confused me when I first looked at it).

Fixed format dates back to the punch card era, and it's a compromise that is no longer needed. Certainly not with brand new code written in-house.

@dmocko
Copy link
Contributor Author

dmocko commented Jul 23, 2024

@emkemp Nope - if you insist on "0993d3c", then I also like "303dfe1" as well. So keep them both.

[ERR] usually means LIS cannot continue, so messages for less serious
events have been changed to [WARN].

Also changed "Error:" to "[ERR]" for consistency.
@jvgeiger
Copy link
Contributor

Would you please give me the ldt.config file that you used to make your lis_input.nldas.noahmp401.nc file?

@dmocko
Copy link
Contributor Author

dmocko commented Jul 24, 2024

Would you please give me the ldt.config file that you used to make your lis_input.nldas.noahmp401.nc file?

Also placed in: /discover/nobackup/projects/lis/tools/pull_requests/1578

In the same directory, I also placed a "data" directory with a required mask file. Please compile LDT big_endian for the mask.

@dmocko
Copy link
Contributor Author

dmocko commented Jul 24, 2024

@dmocko You can ignore my other question about /css access. LIS can see it just fine on the Milan nodes. (I was probably thinking about a special node to write to /css, but that is not needed for these test cases.)

Notable finding: LIS will typically crash when launched with mpirun with a MPI error:

Abort(1090063) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack: MPIR_Init_thread(189)...: MPID_Init(1547).........: MPIDU_Init_shm_init(180): unable to allocate shared memory

But I don't see that when I use srun. Not sure what the problem is (the domain is 468x228).

Hi @emkemp - Are you still getting this error when using mpirun? Have you tried the nldas.job SLURM script I shared?

@emkemp
Copy link
Contributor

emkemp commented Jul 24, 2024

Hi @dmocko

It was still crashing with mpirun. I didn't try your job script (no modules were defined), but I did tweak the SBATCH settings.

@emkemp
Copy link
Contributor

emkemp commented Jul 24, 2024

@dmocko

I tried your nldas.job (tweaking the lis.config filename, and commenting out the call to dmm_get_sbus.bash), and it again crashes with mpirun.

Abort(1090063) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Un
known error class, error stack:
MPIR_Init_thread(189)...:
MPID_Init(1547).........:
MPIDU_Init_shm_init(180): unable to allocate shared memory
Abort(1090063) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Un
known error class, error stack:
MPIR_Init_thread(189)...:
MPID_Init(1547).........:
MPIDU_Init_shm_init(180): unable to allocate shared memory
Abort(1090063) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack:
MPIR_Init_thread(189)...:
MPID_Init(1547).........:
MPIDU_Init_shm_init(180): unable to allocate shared memory
Abort(1090063) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack:
MPIR_Init_thread(189)...:
MPID_Init(1547).........:
MPIDU_Init_shm_init(180): unable to allocate shared memory
Abort(1090063) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack:
MPIR_Init_thread(189)...:
MPID_Init(1547).........:
MPIDU_Init_shm_init(180): unable to allocate shared memory

@jvgeiger
Copy link
Contributor

@dmocko Thanks for the LDT config file. I am unable to reproduce your domain and parameter file. Do you still have the LDT log files?

@jvgeiger
Copy link
Contributor

@emkemp I was able to run this pull request last week. I will pull the latest commits and try again.

@dmocko
Copy link
Contributor Author

dmocko commented Jul 24, 2024

@emkemp I compiled and ran using these modules:

Currently Loaded Modules:
  1) comp/gcc/13.2.0   2) comp/intel/2023.2.1   3) mpi/impi/2021.11   4) git/2.42.0   5) lisf_7.5_intel_2023.2.1

I also copied my configure.lis file into the pull_requests/1578 directory.

@dmocko
Copy link
Contributor Author

dmocko commented Jul 24, 2024

@dmocko Thanks for the LDT config file. I am unable to reproduce your domain and parameter file. Do you still have the LDT log files?

Amazingly, I do! I also put them in: /discover/nobackup/projects/lis/tools/pull_requests/1578

Obviously, my lis_input and ldt.config files are 5+ years old. I will try to generate the lis_input file again,
and then run the new NLDAS-2.0 forcing code again with it to confirm that it still works.

@dmocko
Copy link
Contributor Author

dmocko commented Jul 24, 2024

Obviously, my lis_input and ldt.config files are 5+ years old. I will try to generate the lis_input file again, and then run the new NLDAS-2.0 forcing code again with it to confirm that it still works.

Actually, I just ran the ldt.config file again on Milan, and got the same exact ldtlog file. And the lis_input file is identical, other than some points in the "HYMAP_grid_elevation" field. But since I'm not running HYMAP for this testcase, it's not an issue. Do you see other fields that are different?

I put the new lis_input file here: /discover/nobackup/projects/lis/tools/pull_requests/1578/newldt

@dmocko
Copy link
Contributor Author

dmocko commented Jul 24, 2024

I just pulled the latest commits, and was able to run without any issue using "mpirun" and my SLURM script.

@emkemp
Copy link
Contributor

emkemp commented Jul 25, 2024

Huh. When I recompile from scratch with my older SLES15 environment, nldas.job runs w/o issue.

The module list is:

  1. comp/gcc/13.2.0 5) lisf_7.5_intel_2023.2.1 9) xpdf/4.04
  2. comp/intel/2023.2.1 6) nccmp/1.8.6.5 10) xxdiff/3.2.0
  3. mpi/impi/2021.11 7) nco/5.1.7
  4. git/2.42.0 8) ncview/2.1.7

I'll experiment more with my newer SLES15 environment.

@emkemp
Copy link
Contributor

emkemp commented Jul 25, 2024

Huh part 2. I recompiled in my new SLES15 environment:

  1. comp/gcc/13.2.0 5) ImageMagick/7.1.1-15 9) ncview/2.1.7
  2. comp/intel/2023.2.1 6) git/2.45.0 10) xpdf/4.04
  3. mpi/impi/2021.11 7) nccmp/1.8.6.5 11) xxdiff/3.2.0
  4. lisf_7.6_intel_2023.2.1_emk 8) nco/5.1.7

And nldas.job ran successfully 3 times. Maybe I was getting a bad node a few days ago?

@emkemp
Copy link
Contributor

emkemp commented Jul 25, 2024

...aaand I get a crash on test 4.

But, the 3 successful tests with my new environment executable were launched (sbatched) in my old environment. This latest test was launched in the new environment.

I wonder if there is a problem introduced by the ImageMagick/7.1.1-15 module. I will investigate.

@jvgeiger
Copy link
Contributor

And the lis_input file is identical, other than some points in the "HYMAP_grid_elevation" field.

That is what I saw as well. Thanks.

@emkemp
Copy link
Contributor

emkemp commented Jul 25, 2024

Grrr. Removing ImageMagick/7.1.1-15 doesn't solve the problem. Worse, the most recent failure doesn't have any MPI error message, it just stops with an error code. (I changed the execution line to "mpirun -np 252 ./LIS -f lis.config || exit 1".) There is no output or lislog, the SLURM sjobexitmod command clearly indicates a runtime failure.

[emkemp@discover32 /discover/nobackup/projects/usaf_lis/emkemp/pr/1578/LISF/lis]$ sjobexitmod -l 36405759
JobID Account NNodes NodeList State ExitCode DerivedExitCode Comment


36405759 s1189 2 borgl[103,105] FAILED 1:0 0:0

@dmocko
Copy link
Contributor Author

dmocko commented Jul 25, 2024

@emkemp - But is this a reason to not accept this pull request? It runs with a standard environment.

We need this PR merged into master so it's available as soon as possible for operational users.

Sometime in August, the NLDAS-2 GRIB-1 stream will be turned off at the GES DISC, and users need to transition to the netCDF-4 stream.

@emkemp
Copy link
Contributor

emkemp commented Jul 25, 2024

@dmocko Since I'm the only one experiencing this problem, I will say 'no' on denying the pull request. It's still possible there is something in my environment that is interfering. I just would like to know why this is happening. @jvgeiger and I will discuss at 2 pm today.

@emkemp
Copy link
Contributor

emkemp commented Jul 25, 2024

@dmocko Jim and I are satisfied. Pull request APPROVED.

@emkemp emkemp merged commit b36e3e7 into NASA-LIS:master Jul 25, 2024
jvgeiger added a commit to jvgeiger/LISF that referenced this pull request Jul 25, 2024
jvgeiger added a commit to jvgeiger/LISF that referenced this pull request Jul 25, 2024
@dmocko dmocko deleted the feature/nldas2-lismetforcing-netcdf branch July 30, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Update to documentation (User's Guide or within the code) enhancement New feature or request HighPriority metforcing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need to use NLDAS-2 netCDF-4 format data as LIS metforcing
3 participants