Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug rc5 issues #475

Closed
wants to merge 4 commits into from
Closed

Debug rc5 issues #475

wants to merge 4 commits into from

Conversation

forsyth2
Copy link
Collaborator

@forsyth2 forsyth2 commented Aug 8, 2023

Debug rc5 issues mentioned in #474.

This pull request will probably not be merged. Making it to easily share debugging process.

@forsyth2 forsyth2 added the semver: bug Bug fix (will increment patch version) label Aug 8, 2023
@forsyth2 forsyth2 self-assigned this Aug 8, 2023
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Aug 8, 2023

Commit 1: Chrysalis -- latest zppy dev (conda activate zppy_dev_pre_rc6), using rc9 packages
Code appears to work fine

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/
$ grep -v "OK" *status
# No failures
$ grep -n "Segmentation" *
# No results
$ grep -n "BrokenProcessPool" *
# No results

Output:

$ mv /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/v2.LR.historical_0201 /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/v2.LR.historical_0201_20230808v3_zppy_dev_rc9_packages
$ mv /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/v2.LR.historical_0201/post /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/v2.LR.historical_0201/post_20230808v3_zppy_dev_rc9_packages

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Aug 8, 2023

Commit 2: Chrysalis -- latest zppy dev (conda activate zppy_dev_pre_rc6), using rc10 packages
Code fails with the errors described in #474

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/
$ grep -v "OK" *status
# No failures
$ grep -n "Segmentation" *
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1851.o371015:329:[chr-0244:646185:0:646330] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x25008053d6)
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1851.o371015:331:[chr-0244:646185:0:646347] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1f9a6)
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1853.o371017:369:[chr-0250:776161:0:776314] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x15772c6e63f8)
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1853.o371017:423:[chr-0250:776161:0:776331] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853.o371016:326:[chr-0249:798444:0:798619] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x23f732a2e8)
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853.o371016:372:[chr-0249:798444:0:798636] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x24ef84bb2a)
$ grep -n "BrokenProcessPool" *
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1851.o371015:687:concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1853.o371017:614:concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853.o371016:583:concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Output:

$ mv /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/v2.LR.historical_0201 /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/v2.LR.historical_0201_20230808v4_zppy_dev_rc10_packages
$ mv /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/v2.LR.historical_0201/post /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/v2.LR.historical_0201/post_20230808v4_zppy_dev_rc10_packages

@mahf708
Copy link

mahf708 commented Aug 8, 2023

do you have a log of versions changed between rc9 and rc10 in the packages? I can probably produce it if not...

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Aug 8, 2023

do you have a log of versions changed between rc9 and rc10 in the packages? I can probably produce it if not...

@mahf708 The difference is between Unified rc9 and Unified rc10, so I think versions lists for each are probably something @xylar would have

@mahf708
Copy link

mahf708 commented Aug 8, 2023

1c1
< # packages in environment at /lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.0rc9_login:
---
> # packages in environment at /lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.0rc10_login:
16c16
< async-lru                 2.0.3              pyhd8ed1ab_0    conda-forge
---
> async-lru                 2.0.4              pyhd8ed1ab_0    conda-forge
59c59
< cdms2                     3.1.5           py310heeafeea_20    conda-forge
---
> cdms2                     3.1.5           py310heeafeea_21    conda-forge
67d66
< cfitsio                   4.2.0                hd9d235c_0    conda-forge
78c77
< comm                      0.1.3              pyhd8ed1ab_0    conda-forge
---
> comm                      0.1.4              pyhd8ed1ab_0    conda-forge
90c89
< debugpy                   1.6.7           py310heca2aa9_0    conda-forge
---
> debugpy                   1.6.8           py310hc6cd4ac_0    conda-forge
96c95
< e3sm-unified              1.9.0rc9        mpi_mpich_py310_hdc99501_0    e3sm/label/e3sm_dev
---
> e3sm-unified              1.9.0rc10       mpi_mpich_py310_hdc99501_0    e3sm/label/e3sm_dev
98c97
< e3sm_to_cmip              1.10.0rc1          pyhe9a6732_0    conda-forge/label/e3sm_to_cmip_dev
---
> e3sm_to_cmip              1.10.0rc2          pyhe9a6732_0    conda-forge/label/e3sm_to_cmip_dev
119c118
< fonttools                 4.41.1          py310h2372a71_0    conda-forge
---
> fonttools                 4.42.0          py310h2372a71_0    conda-forge
156c155
< imagecodecs               2023.7.10       py310h4c4fb95_0    conda-forge
---
> imagecodecs               2023.7.10       py310hc929067_2    conda-forge
167c166
< ipywidgets                8.0.7              pyhd8ed1ab_0    conda-forge
---
> ipywidgets                8.1.0              pyhd8ed1ab_0    conda-forge
170c169
< jedi                      0.18.2             pyhd8ed1ab_0    conda-forge
---
> jedi                      0.19.0             pyhd8ed1ab_0    conda-forge
179c178
< jsonschema                4.18.4             pyhd8ed1ab_0    conda-forge
---
> jsonschema                4.18.6             pyhd8ed1ab_0    conda-forge
181c180
< jsonschema-with-format-nongpl 4.18.4             pyhd8ed1ab_0    conda-forge
---
> jsonschema-with-format-nongpl 4.18.6             pyhd8ed1ab_0    conda-forge
187c186
< jupyter_events            0.6.3              pyhd8ed1ab_1    conda-forge
---
> jupyter_events            0.7.0              pyhd8ed1ab_1    conda-forge
190c189
< jupyterlab                4.0.3              pyhd8ed1ab_0    conda-forge
---
> jupyterlab                4.0.4              pyhd8ed1ab_0    conda-forge
207c206
< libarrow                  12.0.1           h657c46f_6_cpu    conda-forge
---
> libarrow                  12.0.1           h657c46f_7_cpu    conda-forge
214c213
< libcap                    2.67                 he9d0100_0    conda-forge
---
> libcap                    2.69                 h0f662aa_0    conda-forge
219,220c218,219
< libclang                  15.0.7          default_h7634d5b_2    conda-forge
< libclang13                15.0.7          default_h9986a30_2    conda-forge
---
> libclang                  15.0.7          default_h7634d5b_3    conda-forge
> libclang13                15.0.7          default_h9986a30_3    conda-forge
248c247
< libllvm14                 14.0.6               hcd5def8_3    conda-forge
---
> libllvm14                 14.0.6               hcd5def8_4    conda-forge
262c261
< librsvg                   2.56.1               h98fae49_0    conda-forge
---
> librsvg                   2.56.3               h98fae49_0    conda-forge
268c267
< libsystemd0               253                  h8c4010b_1    conda-forge
---
> libsystemd0               254                  h3516f8a_0    conda-forge
272c271
< libudev1                  253                  h0b41bf4_1    conda-forge
---
> libudev1                  254                  h3f72095_0    conda-forge
293c292
< mache                     1.17.0rc1          pyh4bc9f2b_0    conda-forge/label/mache_dev
---
> mache                     1.17.0rc2          pyh4bc9f2b_0    conda-forge/label/mache_dev
302c301
< mpas-analysis             1.9.0rc3           pyh320ef33_0    conda-forge/label/mpas_analysis_dev
---
> mpas-analysis             1.9.0rc4           pyh320ef33_0    conda-forge/label/mpas_analysis_dev
309c308
< mpich                     4.1.1              h846660c_100    conda-forge
---
> mpich                     4.1.2              h846660c_100    conda-forge
322c321
< nbformat                  5.9.1              pyhd8ed1ab_0    conda-forge
---
> nbformat                  5.9.2              pyhd8ed1ab_0    conda-forge
343c342
< openssl                   3.1.1                hd590300_1    conda-forge
---
> openssl                   3.1.2                hd590300_0    conda-forge
368c367
< platformdirs              3.9.1              pyhd8ed1ab_0    conda-forge
---
> platformdirs              3.10.0             pyhd8ed1ab_0    conda-forge
385c384
< pyarrow                   12.0.1          py310h0576679_6_cpu    conda-forge
---
> pyarrow                   12.0.1          py310h0576679_7_cpu    conda-forge
391c390
< pyparsing                 3.1.0              pyhd8ed1ab_0    conda-forge
---
> pyparsing                 3.1.1              pyhd8ed1ab_0    conda-forge
405c404
< python-utils              3.7.0              pyhd8ed1ab_0    conda-forge
---
> python-utils              3.7.0              pyhd8ed1ab_1    conda-forge
418c417
< referencing               0.30.0             pyhd8ed1ab_0    conda-forge
---
> referencing               0.30.1             pyhd8ed1ab_0    conda-forge
433c432
< sip                       6.7.10          py310hc6cd4ac_0    conda-forge
---
> sip                       6.7.11          py310hc6cd4ac_0    conda-forge
535c534
< zppy                      2.3.0rc3           pyh51c0ceb_0    conda-forge/label/zppy_dev
---
> zppy                      2.3.0rc5           pyh51c0ceb_0    conda-forge/label/zppy_dev

@mahf708
Copy link

mahf708 commented Aug 8, 2023

Okay, do you have any guesses where exactly the segfault is originating? I cannot quite figure it out from the logs. Also, is dask being used to trigger additional jobs or just in-node jobs? (Not sure if you know the latter)

@mahf708
Copy link

mahf708 commented Aug 8, 2023

The nodes associated with your seg faults are: chr-0249, chr-0244, chr-0250

@mahf708
Copy link

mahf708 commented Aug 8, 2023

Has anyone ever seen BFD: Dwarf Error: ... that's new to me.

BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.

@chengzhuzhang
Copy link
Collaborator

@mahf708 thanks for helping trouble shooting. The listed packages are for login_node. I think we should look at the compute node version of packages?

@mahf708
Copy link

mahf708 commented Aug 8, 2023

They're almost the same. I will post the other list here.

@mahf708
Copy link

mahf708 commented Aug 8, 2023

1c1
< # packages in environment at /lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.0rc9_chrysalis:
---
> # packages in environment at /lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.0rc10_chrysalis:
16c16
< async-lru                 2.0.3              pyhd8ed1ab_0    conda-forge
---
> async-lru                 2.0.4              pyhd8ed1ab_0    conda-forge
59c59
< cdms2                     3.1.5           py310heeafeea_20    conda-forge
---
> cdms2                     3.1.5           py310heeafeea_21    conda-forge
67d66
< cfitsio                   4.2.0                hd9d235c_0    conda-forge
78c77
< comm                      0.1.3              pyhd8ed1ab_0    conda-forge
---
> comm                      0.1.4              pyhd8ed1ab_0    conda-forge
90c89
< debugpy                   1.6.7           py310heca2aa9_0    conda-forge
---
> debugpy                   1.6.8           py310hc6cd4ac_0    conda-forge
96c95
< e3sm-unified              1.9.0rc9        hpc_py310_hd6e50ed_0    e3sm/label/e3sm_dev
---
> e3sm-unified              1.9.0rc10       hpc_py310_hd6e50ed_0    e3sm/label/e3sm_dev
98c97
< e3sm_to_cmip              1.10.0rc1          pyhe9a6732_0    conda-forge/label/e3sm_to_cmip_dev
---
> e3sm_to_cmip              1.10.0rc2          pyhe9a6732_0    conda-forge/label/e3sm_to_cmip_dev
119c118
< fonttools                 4.41.1          py310h2372a71_0    conda-forge
---
> fonttools                 4.42.0          py310h2372a71_0    conda-forge
156c155
< imagecodecs               2023.7.10       py310h4c4fb95_0    conda-forge
---
> imagecodecs               2023.7.10       py310hc929067_2    conda-forge
167c166
< ipywidgets                8.0.7              pyhd8ed1ab_0    conda-forge
---
> ipywidgets                8.1.0              pyhd8ed1ab_0    conda-forge
170c169
< jedi                      0.18.2             pyhd8ed1ab_0    conda-forge
---
> jedi                      0.19.0             pyhd8ed1ab_0    conda-forge
179c178
< jsonschema                4.18.4             pyhd8ed1ab_0    conda-forge
---
> jsonschema                4.18.6             pyhd8ed1ab_0    conda-forge
181c180
< jsonschema-with-format-nongpl 4.18.4             pyhd8ed1ab_0    conda-forge
---
> jsonschema-with-format-nongpl 4.18.6             pyhd8ed1ab_0    conda-forge
187c186
< jupyter_events            0.6.3              pyhd8ed1ab_1    conda-forge
---
> jupyter_events            0.7.0              pyhd8ed1ab_1    conda-forge
190c189
< jupyterlab                4.0.3              pyhd8ed1ab_0    conda-forge
---
> jupyterlab                4.0.4              pyhd8ed1ab_0    conda-forge
207c206
< libarrow                  12.0.1           h657c46f_6_cpu    conda-forge
---
> libarrow                  12.0.1           h657c46f_7_cpu    conda-forge
214c213
< libcap                    2.67                 he9d0100_0    conda-forge
---
> libcap                    2.69                 h0f662aa_0    conda-forge
218,219c217,218
< libclang                  15.0.7          default_h7634d5b_2    conda-forge
< libclang13                15.0.7          default_h9986a30_2    conda-forge
---
> libclang                  15.0.7          default_h7634d5b_3    conda-forge
> libclang13                15.0.7          default_h9986a30_3    conda-forge
246c245
< libllvm14                 14.0.6               hcd5def8_3    conda-forge
---
> libllvm14                 14.0.6               hcd5def8_4    conda-forge
259c258
< librsvg                   2.56.1               h98fae49_0    conda-forge
---
> librsvg                   2.56.3               h98fae49_0    conda-forge
265c264
< libsystemd0               253                  h8c4010b_1    conda-forge
---
> libsystemd0               254                  h3516f8a_0    conda-forge
289c288
< mache                     1.17.0rc1          pyh4bc9f2b_0    conda-forge/label/mache_dev
---
> mache                     1.17.0rc2          pyh4bc9f2b_0    conda-forge/label/mache_dev
297c296
< mpas-analysis             1.9.0rc3           pyh320ef33_0    conda-forge/label/mpas_analysis_dev
---
> mpas-analysis             1.9.0rc4           pyh320ef33_0    conda-forge/label/mpas_analysis_dev
315c314
< nbformat                  5.9.1              pyhd8ed1ab_0    conda-forge
---
> nbformat                  5.9.2              pyhd8ed1ab_0    conda-forge
334c333
< openssl                   3.1.1                hd590300_1    conda-forge
---
> openssl                   3.1.2                hd590300_0    conda-forge
357c356
< platformdirs              3.9.1              pyhd8ed1ab_0    conda-forge
---
> platformdirs              3.10.0             pyhd8ed1ab_0    conda-forge
374c373
< pyarrow                   12.0.1          py310h0576679_6_cpu    conda-forge
---
> pyarrow                   12.0.1          py310h0576679_7_cpu    conda-forge
380c379
< pyparsing                 3.1.0              pyhd8ed1ab_0    conda-forge
---
> pyparsing                 3.1.1              pyhd8ed1ab_0    conda-forge
394c393
< python-utils              3.7.0              pyhd8ed1ab_0    conda-forge
---
> python-utils              3.7.0              pyhd8ed1ab_1    conda-forge
407c406
< referencing               0.30.0             pyhd8ed1ab_0    conda-forge
---
> referencing               0.30.1             pyhd8ed1ab_0    conda-forge
422c421
< sip                       6.7.10          py310hc6cd4ac_0    conda-forge
---
> sip                       6.7.11          py310hc6cd4ac_0    conda-forge
521c520
< zppy                      2.3.0rc3           pyh51c0ceb_0    conda-forge/label/zppy_dev
---
> zppy                      2.3.0rc5           pyh51c0ceb_0    conda-forge/label/zppy_dev

@mahf708
Copy link

mahf708 commented Aug 8, 2023

diff of diffs...

2c2
< < # packages in environment at /lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.0rc9_chrysalis:
---
> < # packages in environment at /lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.0rc9_login:
4c4
< > # packages in environment at /lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.0rc10_chrysalis:
---
> > # packages in environment at /lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.0rc10_login:
24c24
< < e3sm-unified              1.9.0rc9        hpc_py310_hd6e50ed_0    e3sm/label/e3sm_dev
---
> < e3sm-unified              1.9.0rc9        mpi_mpich_py310_hdc99501_0    e3sm/label/e3sm_dev
26c26
< > e3sm-unified              1.9.0rc10       hpc_py310_hd6e50ed_0    e3sm/label/e3sm_dev
---
> > e3sm-unified              1.9.0rc10       mpi_mpich_py310_hdc99501_0    e3sm/label/e3sm_dev
71c71
< 218,219c217,218
---
> 219,220c218,219
77c77
< 246c245
---
> 248c247
81c81
< 259c258
---
> 262c261
85c85
< 265c264
---
> 268c267
89c89,93
< 289c288
---
> 272c271
> < libudev1                  253                  h0b41bf4_1    conda-forge
> ---
> > libudev1                  254                  h3f72095_0    conda-forge
> 293c292
93c97
< 297c296
---
> 302c301
97c101,105
< 315c314
---
> 309c308
> < mpich                     4.1.1              h846660c_100    conda-forge
> ---
> > mpich                     4.1.2              h846660c_100    conda-forge
> 322c321
101c109
< 334c333
---
> 343c342
105c113
< 357c356
---
> 368c367
109c117
< 374c373
---
> 385c384
113c121
< 380c379
---
> 391c390
117c125
< 394c393
---
> 405c404
121c129
< 407c406
---
> 418c417
125c133
< 422c421
---
> 433c432
129c137
< 521c520
---
> 535c534

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Aug 8, 2023

Okay, do you have any guesses where exactly the segfault is originating? I cannot quite figure it out from the logs.

It looks like you've managed to look at the output files. In any case, running grep -B 5 -n "Segmentation" * from /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/v2.LR.historical_0201/post_20230808v4_zppy_dev_rc10_packages/scripts seems to show the segfault occurring after processing SST or albedo in most cases.

Also, is dask being used to trigger additional jobs or just in-node jobs? (Not sure if you know the latter)

I unfortunately have no idea. I don't work with dask directly.

@chengzhuzhang
Copy link
Collaborator

In a standalone test for e3sm_diags using unified rc9 vs rc10 on Perlmutter:
Upon initiating the run, I see unexpected MPI messages in rc10 only (not in rc9):

PE 0: MPICH processor detected:
PE 0:   AMD Milan (25:1:1) (family:model:stepping)
MPI VERSION    : CRAY MPICH version 8.1.24.16 (ANL base 3.4a2)
MPI BUILD INFO : Wed Jan 18 17:36 2023 (git hash 11b1c78) (CH4)
PE 0: MPICH environment settings =====================================
PE 0:   MPICH_ENV_DISPLAY                              = 1
PE 0:   MPICH_VERSION_DISPLAY                          = 1
PE 0:   MPICH_ABORT_ON_ERROR                           = 0
PE 0:   MPICH_CPUMASK_DISPLAY                          = 0
PE 0:   MPICH_STATS_DISPLAY                            = 0
PE 0:   MPICH_RANK_REORDER_METHOD                      = 1
PE 0:   MPICH_RANK_REORDER_DISPLAY                     = 0
PE 0:   MPICH_MEMCPY_MEM_CHECK                         = 0
PE 0:   MPICH_USE_SYSTEM_MEMCPY                        = 0
PE 0:   MPICH_OPTIMIZED_MEMCPY                         = 1
PE 0:   MPICH_ALLOC_MEM_PG_SZ                          = 4096
PE 0:   MPICH_ALLOC_MEM_POLICY                         = PREFERRED
PE 0:   MPICH_ALLOC_MEM_AFFINITY                       = SYS_DEFAULT
PE 0:   MPICH_MALLOC_FALLBACK                          = 0
PE 0:   MPICH_MEM_DEBUG_FNAME                          = 
PE 0:   MPICH_INTERNAL_MEM_AFFINITY                    = SYS_DEFAULT
PE 0:   MPICH_NO_BUFFER_ALIAS_CHECK                    = 0
PE 0:   MPICH_COLL_SYNC                                = 0
PE 0:   MPICH_SINGLE_HOST_ENABLED                        = 1
PE 0: MPICH/RMA environment settings =================================
PE 0:   MPICH_RMA_MAX_PENDING                          = 128
PE 0:   MPICH_RMA_SHM_ACCUMULATE                       = 0
PE 0: MPICH/Dynamic Process Management environment settings ==========
PE 0:   MPICH_DPM_DIR                                  = 
PE 0:   MPICH_LOCAL_SPAWN_SERVER                       = 0
PE 0:   MPICH_SPAWN_USE_RANKPOOL                       = 1
PE 0: MPICH/SMP environment settings =================================
PE 0:   MPICH_SMP_SINGLE_COPY_MODE                     = XPMEM
PE 0:   MPICH_SMP_SINGLE_COPY_SIZE                     = 8192
PE 0:   MPICH_SHM_PROGRESS_MAX_BATCH_SIZE              = 8
PE 0: MPICH/COLLECTIVE environment settings ==========================
PE 0:   MPICH_COLL_OPT_OFF                             = 0
PE 0:   MPICH_BCAST_ONLY_TREE                          = 1
PE 0:   MPICH_BCAST_INTERNODE_RADIX                    = 4
PE 0:   MPICH_BCAST_INTRANODE_RADIX                    = 4
PE 0:   MPICH_ALLTOALL_SHORT_MSG                       = 64-512
PE 0:   MPICH_ALLTOALL_SYNC_FREQ                       = 1-24
PE 0:   MPICH_ALLTOALLV_THROTTLE                       = 8
PE 0:   MPICH_ALLGATHER_VSHORT_MSG                     = 1024-4096
PE 0:   MPICH_ALLGATHERV_VSHORT_MSG                    = 1024-4096
PE 0:   MPICH_GATHERV_SHORT_MSG                        = 131072
PE 0:   MPICH_GATHERV_MIN_COMM_SIZE                    = 64
PE 0:   MPICH_GATHERV_MAX_TMP_SIZE                     = 536870912
PE 0:   MPICH_GATHERV_SYNC_FREQ                        = 16
PE 0:   MPICH_IGATHERV_RAND_COMMSIZE                   = 2048
PE 0:   MPICH_IGATHERV_RAND_RECVLIST                   = 0
PE 0:   MPICH_SCATTERV_SHORT_MSG                       = 2048-8192
PE 0:   MPICH_SCATTERV_MIN_COMM_SIZE                   = 64
PE 0:   MPICH_SCATTERV_MAX_TMP_SIZE                    = 536870912
PE 0:   MPICH_SCATTERV_SYNC_FREQ                       = 16
PE 0:   MPICH_SCATTERV_SYNCHRONOUS                     = 0
PE 0:   MPICH_ALLREDUCE_MAX_SMP_SIZE                   = 262144
PE 0:   MPICH_ALLREDUCE_BLK_SIZE                       = 716800
PE 0:   MPICH_GPU_ALLREDUCE_USE_KERNEL                 = 0
PE 0:   MPICH_GPU_COLL_STAGING_BUF_SIZE                = 1048576
PE 0:   MPICH_GPU_ALLREDUCE_STAGING_THRESHOLD          = 256
PE 0:   MPICH_ALLREDUCE_NO_SMP                         = 0
PE 0:   MPICH_REDUCE_NO_SMP                            = 0
PE 0:   MPICH_REDUCE_SCATTER_COMMUTATIVE_LONG_MSG_SIZE = 524288
PE 0:   MPICH_REDUCE_SCATTER_MAX_COMMSIZE              = 1000
PE 0:   MPICH_SHARED_MEM_COLL_OPT                      = 1
PE 0:   MPICH_SHARED_MEM_COLL_NCELLS                   = 8
PE 0:   MPICH_SHARED_MEM_COLL_CELLSZ                   = 256
PE 0: MPICH MPIIO environment settings ===============================
PE 0:   MPICH_MPIIO_HINTS_DISPLAY                      = 0
PE 0:   MPICH_MPIIO_HINTS                              = NULL
PE 0:   MPICH_MPIIO_ABORT_ON_RW_ERROR                  = disable
PE 0:   MPICH_MPIIO_CB_ALIGN                           = 2
PE 0:   MPICH_MPIIO_DVS_MAXNODES                       = 24
PE 0:   MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY       = 0
PE 0:   MPICH_MPIIO_AGGREGATOR_PLACEMENT_STRIDE        = -1
PE 0:   MPICH_MPIIO_MAX_NUM_IRECV                      = 50
PE 0:   MPICH_MPIIO_MAX_NUM_ISEND                      = 50
PE 0:   MPICH_MPIIO_MAX_SIZE_ISEND                     = 10485760
PE 0:   MPICH_MPIIO_OFI_STARTUP_CONNECT                = disable
PE 0:   MPICH_MPIIO_OFI_STARTUP_NODES_AGGREGATOR        = 2
PE 0: MPICH MPIIO statistics environment settings ====================
PE 0:   MPICH_MPIIO_STATS                              = 0
PE 0:   MPICH_MPIIO_TIMERS                             = 0
PE 0:   MPICH_MPIIO_WRITE_EXIT_BARRIER                 = 1
PE 0: MPICH Thread Safety settings ===================================
PE 0:   MPICH_ASYNC_PROGRESS                           = 0
PE 0:   MPICH_OPT_THREAD_SYNC                          = 1
PE 0:   rank 0 required = multiple, was provided = multiple

And followed by Traceback

Traceback (most recent call last):
  File "/global/u2/c/chengzhu/e3sm_diags/examples/run_v2_9_0_all_sets_E3SM_machines.py", line 277, in <module>
    run_all_sets()
  File "/global/u2/c/chengzhu/e3sm_diags/examples/run_v2_9_0_all_sets_E3SM_machines.py", line 177, in run_all_sets
    runner.run_diags(
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc10_pm-cpu/lib/python3.10/site-packages/e3sm_diags/run.py", line 34, in run_diags
    main(final_params)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc10_pm-cpu/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 414, in main
    save_provenance(parameters[0].results_dir, parser)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc10_pm-cpu/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 166, in save_provenance
    _save_env_yml(results_dir)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc10_pm-cpu/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 50, in _save_env_yml
    output, err = p.communicate()
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc10_pm-cpu/lib/python3.10/subprocess.py", line 1154, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc10_pm-cpu/lib/python3.10/subprocess.py", line 2021, in _communicate
    ready = selector.select(timeout)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc10_pm-cpu/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Aug 8, 2023

Commit 3: Compy -- latest zppy dev (conda activate zppy_dev_pre_rc6), using rc9 packages
Code appears to work fine

$ /compyfs/fors729/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
# No failures
$ grep -n "Segmentation" *
# No results
$ grep -n "BrokenProcessPool" *
# No results

Output:

$ mv /compyfs/www/fors729/zppy_test_complete_run_www/v2.LR.historical_0201 /compyfs/www/fors729/zppy_test_complete_run_www/v2.LR.historical_0201_20230808v2_zppy_dev_rc9_packages
$ mv /compyfs/fors729/zppy_test_complete_run_output/v2.LR.historical_0201/post /compyfs/fors729/zppy_test_complete_run_output/v2.LR.historical_0201/post_20230808v2_zppy_dev_rc9_packages

@mahf708

This comment was marked as outdated.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Aug 9, 2023

Commit 4: Compy -- latest zppy dev (conda activate zppy_dev_pre_rc6), using rc10 packages
Code fails with the errors described in #474

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/
$ grep -v "OK" *status
climo_atm_monthly_180x360_aave_1850-1851.status:ERROR (3)
climo_atm_monthly_180x360_aave_1850-1853.status:ERROR (3)
climo_atm_monthly_180x360_aave_1852-1853.status:ERROR (3)
climo_atm_monthly_diurnal_8xdaily_180x360_aave_1850-1851.status:ERROR (3)
climo_atm_monthly_diurnal_8xdaily_180x360_aave_1850-1853.status:ERROR (3)
climo_atm_monthly_diurnal_8xdaily_180x360_aave_1852-1853.status:ERROR (3)
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1851.status:WAITING 551181
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1853.status:WAITING 551183
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853.status:WAITING 551182

E.g., from climo_atm_monthly_180x360_aave_1850-1851.o551169:

[[email protected]] [[email protected]] [[email protected]] [[email protected]] match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:91): match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydr\
a_arg.c:91): match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:91): match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:91): unrecognized argument H
ncclimo: ERROR monthly climo cmd_clm[1] failed. Debug this:
mpirun -H n0017 -n 1 ncra --clm_bnd=1850,1851,1,1,0 -O --no_tmp_fl --hdr_pad=10000 --gaa climo_script=ncclimo --gaa climo_command="'/share/apps/E3SM/conda_envs/base/envs/e3sm_unified_1.9.0rc10_compy/bin/ncclimo --case\
=v2.LR.historical_0201 --jobs=4 --thr=1 --parallel=mpi --yr_srt=1850 --yr_end=1851 --input=/compyfs/fors729//E3SMv2/v2.LR.historical_0201/archive/atm/hist --map=/compyfs/diagnostics/maps/map_ne30pg2_to_cmip6_180x360_a\
ave.20200201.nc --output=trash --regrid=output --prc_typ=eam'" --gaa climo_hostname=n0017 --gaa climo_version=5.1.7 --gaa yrs_averaged=1850-1851 -p /compyfs/fors729//E3SMv2/v2.LR.historical_0201/archive/atm/hist  v2.L\
R.historical_0201.eam.h0.1850-01.nc v2.LR.historical_0201.eam.h0.1851-01.nc trash/v2.LR.historical_0201_01_185001_185101_climo.nc

Output:

$ mv /compyfs/www/fors729/zppy_test_complete_run_www/v2.LR.historical_0201 /compyfs/www/fors729/zppy_test_complete_run_www/v2.LR.historical_0201_20230808v3_zppy_dev_rc10_packages
mv: cannot stat ‘/compyfs/www/fors729/zppy_test_complete_run_www/v2.LR.historical_0201’: No such file or directory
$ mv /compyfs/fors729/zppy_test_complete_run_output/v2.LR.historical_0201/post /compyfs/fors729/zppy_test_complete_run_output/v2.LR.historical_0201/post_20230808v3_zppy_dev_rc10_packages

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Aug 9, 2023

@chengzhuzhang Thank you for confirming stand-alone E3SM Diags results on Perlmutter.

@xylar
Copy link
Contributor

xylar commented Aug 9, 2023

Regarding Compy, I feel really stuck.

  1. ESMF_RegridWeightGen doesn't work on Compy with Gnu compilers (what I used in rc9). Since ESMF_RegridWeightGen is a critical tool for E3SM-Unified to support, this seems like a clear no-go.
  2. NCO doesn't compile with Intel compilers
  3. So on compute nodes on Compy I have it coming from conda-forge instead of getting built from source in Spack in rc10. Everywhere else, it is built from source with Gnu compilers. It could be that this just doesn't work right.

I will see today if I can get ESMF_RegridWeightGen to build successfully with some combination of Gnu and MPI modules on Compy. That seems like the only plausible way forward at the moment.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Aug 9, 2023

I will see today if I can get ESMF_RegridWeightGen to build successfully with some combination of Gnu and MPI modules on Compy. That seems like the only plausible way forward at the moment.

Thanks Xylar!

@xylar
Copy link
Contributor

xylar commented Aug 9, 2023

I'm still working on building an rc11 on Compy with Gnu and MVAPICH2. Spack build is taking hours because it's Compy...

@xylar
Copy link
Contributor

xylar commented Aug 9, 2023

@forsyth2, the outcome is that I build Spack packages for about 8 hours today (first with Intel by mistake and then with Gnu and MVAPICH2), and TempestRemap just failed to build:

Cannot find a BLACS library for the given MPI

So that seems to be a dead end. Please focus on the machines other than Compy for now while I try to come up with a plan C. Or is it plan H? plan Z?

@mahf708
Copy link

mahf708 commented Aug 9, 2023

What needs multi-node MPI functionality besides ncclimo?

@xylar, I am sorry 😢 this is a total maintenance nightmare. We gotta think of the pros and cons of having these spack packages...

@mahf708
Copy link

mahf708 commented Aug 10, 2023

Skimming the shapely repo, seems like there was a geos bug that was fixed for 3.12

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Aug 10, 2023

I narrowed my stacktrace in the comment above to the shapely.predicates.has_z() function.

Fatal Python error: Segmentation fault

Current thread 0x00007fc2c5be3740 (most recent call first):
  File "/home/vo13/mambaforge/envs/e3sm_diags_dev/lib/python3.10/site-packages/shapely/predicates.py", line 69 in has_z
  File "/home/vo13/mambaforge/envs/e3sm_diags_dev/lib/python3.10/site-packages/shapely/decorators.py", line 77 in wrapped
  File "/home/vo13/mambaforge/envs/e3sm_diags_dev/lib/python3.10/site-packages/shapely/geometry/base.py", line 607 in has_z
  File "/home/vo13/mambaforge/envs/e3sm_diags_dev/lib/python3.10/site-packages/shapely/geometry/base.py", line 206 in coord

The difference between shapely=1.8.5 and shapely>=2.0.0 is that >=2.0.0 implements has_z() with a decorator called multithreading_enabled. This decorator might be introducing issues some how (GIL needs to be released), at least in my test cases.

shapely=1.8.5:

shapely=2.0.1:

@xylar
Copy link
Contributor

xylar commented Aug 10, 2023

@tomvothecoder, thanks for narrowing things down so much!

What I don't understand is that nothing related to shapely (or geos) changed between rc9 and rc10. You can see #475 (comment) for Chrysalis and here is Perlmutter:

1c1
< # packages in environment at /global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc9_pm-cpu:
---
> # packages in environment at /global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc10_pm-cpu:
16c16
< async-lru                 2.0.3              pyhd8ed1ab_0    conda-forge
---
> async-lru                 2.0.4              pyhd8ed1ab_0    conda-forge
59c59
< cdms2                     3.1.5           py310heeafeea_20    conda-forge
---
> cdms2                     3.1.5           py310heeafeea_21    conda-forge
67d66
< cfitsio                   4.2.0                hd9d235c_0    conda-forge
78c77
< comm                      0.1.3              pyhd8ed1ab_0    conda-forge
---
> comm                      0.1.4              pyhd8ed1ab_0    conda-forge
90c89
< debugpy                   1.6.7           py310heca2aa9_0    conda-forge
---
> debugpy                   1.6.8           py310hc6cd4ac_0    conda-forge
96c95
< e3sm-unified              1.9.0rc9        hpc_py310_hd6e50ed_0    e3sm/label/e3sm_dev
---
> e3sm-unified              1.9.0rc10       hpc_py310_hd6e50ed_0    e3sm/label/e3sm_dev
98c97
< e3sm_to_cmip              1.10.0rc1          pyhe9a6732_0    conda-forge/label/e3sm_to_cmip_dev
---
> e3sm_to_cmip              1.10.0rc2          pyhe9a6732_0    conda-forge/label/e3sm_to_cmip_dev
119c118
< fonttools                 4.41.1          py310h2372a71_0    conda-forge
---
> fonttools                 4.42.0          py310h2372a71_0    conda-forge
156c155
< imagecodecs               2023.7.10       py310h4c4fb95_0    conda-forge
---
> imagecodecs               2023.7.10       py310hc929067_2    conda-forge
167c166
< ipywidgets                8.0.7              pyhd8ed1ab_0    conda-forge
---
> ipywidgets                8.1.0              pyhd8ed1ab_0    conda-forge
170c169
< jedi                      0.18.2             pyhd8ed1ab_0    conda-forge
---
> jedi                      0.19.0             pyhd8ed1ab_0    conda-forge
179c178
< jsonschema                4.18.4             pyhd8ed1ab_0    conda-forge
---
> jsonschema                4.18.6             pyhd8ed1ab_0    conda-forge
181c180
< jsonschema-with-format-nongpl 4.18.4             pyhd8ed1ab_0    conda-forge
---
> jsonschema-with-format-nongpl 4.18.6             pyhd8ed1ab_0    conda-forge
187c186
< jupyter_events            0.6.3              pyhd8ed1ab_1    conda-forge
---
> jupyter_events            0.7.0              pyhd8ed1ab_1    conda-forge
190c189
< jupyterlab                4.0.3              pyhd8ed1ab_0    conda-forge
---
> jupyterlab                4.0.4              pyhd8ed1ab_0    conda-forge
207c206
< libarrow                  12.0.1           h657c46f_6_cpu    conda-forge
---
> libarrow                  12.0.1           h657c46f_7_cpu    conda-forge
214c213
< libcap                    2.67                 he9d0100_0    conda-forge
---
> libcap                    2.69                 h0f662aa_0    conda-forge
218,219c217,218
< libclang                  15.0.7          default_h7634d5b_2    conda-forge
< libclang13                15.0.7          default_h9986a30_2    conda-forge
---
> libclang                  15.0.7          default_h7634d5b_3    conda-forge
> libclang13                15.0.7          default_h9986a30_3    conda-forge
246c245
< libllvm14                 14.0.6               hcd5def8_3    conda-forge
---
> libllvm14                 14.0.6               hcd5def8_4    conda-forge
259c258
< librsvg                   2.56.1               h98fae49_0    conda-forge
---
> librsvg                   2.56.3               h98fae49_0    conda-forge
265c264
< libsystemd0               253                  h8c4010b_1    conda-forge
---
> libsystemd0               254                  h3516f8a_0    conda-forge
289c288
< mache                     1.17.0rc1          pyh4bc9f2b_0    conda-forge/label/mache_dev
---
> mache                     1.17.0rc3          pyh4bc9f2b_0    conda-forge/label/mache_dev
297c296
< mpas-analysis             1.9.0rc3           pyh320ef33_0    conda-forge/label/mpas_analysis_dev
---
> mpas-analysis             1.9.0rc4           pyh320ef33_0    conda-forge/label/mpas_analysis_dev
315c314
< nbformat                  5.9.1              pyhd8ed1ab_0    conda-forge
---
> nbformat                  5.9.2              pyhd8ed1ab_0    conda-forge
334c333
< openssl                   3.1.1                hd590300_1    conda-forge
---
> openssl                   3.1.2                hd590300_0    conda-forge
357c356
< platformdirs              3.9.1              pyhd8ed1ab_0    conda-forge
---
> platformdirs              3.10.0             pyhd8ed1ab_0    conda-forge
374c373
< pyarrow                   12.0.1          py310h0576679_6_cpu    conda-forge
---
> pyarrow                   12.0.1          py310h0576679_7_cpu    conda-forge
380c379
< pyparsing                 3.1.0              pyhd8ed1ab_0    conda-forge
---
> pyparsing                 3.1.1              pyhd8ed1ab_0    conda-forge
394c393
< python-utils              3.7.0              pyhd8ed1ab_0    conda-forge
---
> python-utils              3.7.0              pyhd8ed1ab_1    conda-forge
407c406
< referencing               0.30.0             pyhd8ed1ab_0    conda-forge
---
> referencing               0.30.1             pyhd8ed1ab_0    conda-forge
422c421
< sip                       6.7.10          py310hc6cd4ac_0    conda-forge
---
> sip                       6.7.11          py310hc6cd4ac_0    conda-forge
466c465
< wheel                     0.41.0             pyhd8ed1ab_0    conda-forge
---
> wheel                     0.41.1             pyhd8ed1ab_0    conda-forge
521c520
< zppy                      2.3.0rc3           pyh51c0ceb_0    conda-forge/label/zppy_dev
---
> zppy                      2.3.0rc5           pyh51c0ceb_0    conda-forge/label/zppy_dev

One of these changes must be related.

@xylar
Copy link
Contributor

xylar commented Aug 10, 2023

I tried creating an e3sm_diags CI environment and running. I see the same segfault there:

mamba env create -n e3sm_diags_dev -f conda-env/ci.yml
mamba activate e3sm_diags_dev
python -m pip install .

Then, on a compute node:

source /global/homes/x/xylar/mambaforge/etc/profile.d/conda.sh
source /global/homes/x/xylar/mambaforge/etc/profile.d/mamba.sh
mamba activate e3sm_diags_dev

e3sm_diags lat_lon --no_viewer \
   --reference_data_path '/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology' \
   --test_data_path '/global/cfs/cdirs/e3sm/e3sm_diags/postprocessed_e3sm_v2_data_for_e3sm_diags/20210528.v2rc3e.piControl.ne30pg2_EC30to60E2r2.chrysalis/climatology/rgr/' \
   --results_dir '/global/homes/x/xylar/tmp/results_conda' --case_id 'GPCP_v3.2' \
   --run_type 'model_vs_obs' --sets 'lat_lon' --variables 'PRECT' --seasons 'ANN' \
   --ref_name 'GPCP_v3.2'
2023-08-10 03:07:56,879 [INFO]: lat_lon_driver.py(run_diag:146) >> Variable: PRECT
2023-08-10 03:07:56,953 [INFO]: lat_lon_driver.py(run_diag:225) >> Selected region: global
2023-08-10 03:08:02,205 [INFO]: lat_lon_driver.py(create_and_save_data_and_metrics:51) >> Metrics saved in /global/homes/x/xylar/tmp/results_conda/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global.json
./job_conda.sh: line 7: 2356478 Segmentation fault      e3sm_diags lat_lon --no_viewer --reference_data_path '/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology' --test_data_path '/global/cfs/cdirs/e3sm/e3sm_diags/postprocessed_e3sm_v2_data_for_e3sm_diags/20210528.v2rc3e.piControl.ne30pg2_EC30to60E2r2.chrysalis/climatology/rgr/' --results_dir '/global/homes/x/xylar/tmp/results_conda' --case_id 'GPCP_v3.2' --run_type 'model_vs_obs' --sets 'lat_lon' --variables 'PRECT' --seasons 'ANN' --ref_name 'GPCP_v3.2'

@xylar
Copy link
Contributor

xylar commented Aug 10, 2023

I think that will help a lot with debugging because it suggests it's nothing specific to Spack or E3SM-Unified itself. It's some conda package or combination thereof.

Update: I see that @tomvothecoder had already been testing with just e3sm_diags on its own. Sorry, I missed that.

@xylar
Copy link
Contributor

xylar commented Aug 10, 2023

@chengzhuzhang, bad news, I've been able to narrow it down to the cdms2 patch I did in conda-forge/cdms2-feedstock#89.

If I create an environment with:

mamba create -y -n e3sm_diags_dev -c conda-forge/label/e3sm_diags_dev e3sm_diags=2.9.0rc2 python=3.10 "hdf5=*=mpi_mpich*"

I get

cdms2                     3.1.5           py310heeafeea_21

and the segfault.

If I do:

mamba create -y -n e3sm_diags_dev -c conda-forge/label/e3sm_diags_dev e3sm_diags=2.9.0rc2 python=3.10 cdms2=3.1.5=py310heeafeea_20 "hdf5=*=mpi_mpich*"

I get:

cdms2                     3.1.5           py310heeafeea_20

and no segfault.

So it seems like CDMS2 can't be used with the latest ESMF/ESMPy. This is a pretty giant setback because we've been trying to work hard here to move from ESMF v8.2.0 to v8.4.2, which would be a big jump forward for us.

@mahf708 and I have also put quite a bit for work into supporting ESMF without MPI and that work only applies to v8.4.2, not preceding versions.

I need to think about what the options are.

Update: So far the only thing I've come up with is to install development versions of both e3sm_diags and cdms2 on Perlmutter or Chrysalis and debug some more there. It feels like there's not a good option besides trying to patch cdms2 to work with the latest ESMF/ESMPy. Anything else is a nightmare of trying to take all the other packages back in time.

@forsyth2
Copy link
Collaborator Author

@xylar Thanks for narrowing down the issue! I'm sorry this has become such an ordeal.

@chengzhuzhang
Copy link
Collaborator

@xylar Thank you for continued effort troubleshooting. In my standalone e3sm diags enviroment, I have following:
esmf 8.4.2 nompi_ha7f9e30_1 conda-forge
cdms2 3.1.5 py310heeafeea_20 conda-forge
hdf5 1.14.1 nompi_h4f84152_100 conda-forge

This environment works okay. So my understanding is that we need to patch cdms2 to be working with mpi version of esmf?

@xylar
Copy link
Contributor

xylar commented Aug 10, 2023

@chengzhuzhang, I don't think MPI has anything to do with it. I just happened to get the MPI version in one case and the nompi in another. I tested with both and build 20, and it was fine either way.

So I think the issue is entirely with my patch and unrelated to mpi_mpich vs. nompi.

@tomvothecoder
Copy link
Collaborator

I tested Xylar's e3sm_diags environments in his comment here and confirmed that the e3sm_diags works with the older patch version 20 cdms2=3.1.5.

In my comment here, the latest patch version 21 of cdms2=3.1.5 might be running into issues with shapely>2.0.0 (maybe multi-threading GIL issues?). This might explain why downgrading to shapely=1.8.5 works with the latest patched version 21 of cdms2.

In any case, as Xylar mentioned, fixing cdms2 is the way to go since it is a direct dependency of e3sm_diags.

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Aug 10, 2023

Oh, I understand now! It is the cdms2 built 21 caused the problem. I failed to notice the build number. Thanks @xylar and @tomvothecoder for further confirm!

@forsyth2 forsyth2 linked an issue Aug 11, 2023 that may be closed by this pull request
@xylar
Copy link
Contributor

xylar commented Aug 14, 2023

Unfortunately, I didn't have time to debug this today. I will try to figure it out tomorrow.

@chengzhuzhang
Copy link
Collaborator

Thanks for the heads-up. Let us know anything that the team can help further troubleshooting or solve this issue.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Aug 14, 2023

Thanks for the heads-up. Let us know anything that the team can help further troubleshooting or solve this issue.

I created stand-alone environments with different versions cdms2 and shapely and ran this minimum reproducible example based on the stacktrace above. The results are recorded below.

Test code

import cdms2
from shapely.geometry.polygon import LinearRing

geo = LinearRing(((-180, -90), (-180, 90), (180, 90), (180, -90), (-180, -90)))

geo.has_z

Environment test cases

  1. cdms2 patch 20 and shapely=1.8.5
    • Env: mamba create -n cdms_shapely_1 -c conda-forge cdms2=3.1.5=py310heeafeea_20 shapely=1.8.5
    • Test code output: successful
  2. cdms2 patch 20 and shapely=2.0.1
    • Env: mamba create -n cdms_shapely_2 -c conda-forge cdms2=3.1.5=py310heeafeea_20 shapely=2.0.1
    • Test code output: successful
  3. cdms2 patch 21 and shapely=1.8.5
    • Env: mamba create -n cdms_shapely_3 -c conda-forge cdms2=3.1.5=py310heeafeea_21 shapely=1.8.5
    • Test code output: successful and [WARNING] yaksa: 10 leaked handle pool objects
  4. cdms2 patch 21 and shapely=2.0.1
    • Env: mamba create -n cdms_shapely_4 -c conda-forge cdms2=3.1.5=py310heeafeea_21 shapely=2.0.1
    • Test code output: Segmentation fault (core dumped)

Result

Test case 4 with cdms2 patch 21 and shapely=2.01 is the only one that breaks. These versions of cdms2 and shapely are being installed in the latest e3sm_diags=2.9.0rc2 and e3sm_unified RC 10.

Possible reasons

As mentioned in this comment, shapely>=2.0 now wraps the shapely.predicates.has_z() function with the new multithreading_enabled decorator. This decorator requires the GIL to be released so that it can be used for whatever function is being wrapped (shapely.predicates.has_z() in this case).

My findings lead me to believe that the multithreading_enabled decorator can't be executed because of the GIL is occupied by a thread from another Python process, resulting in the segmentation fault. Here is more info on Python's GIL and segmentation faults in this Stack Overflow post.

The question is: what was introduced in cdms2 patch 21 (and probably esmf=8.4.2) that might be conflicting with shapely>=2.0?

@xylar
Copy link
Contributor

xylar commented Aug 15, 2023

@tomvothecoder, your investigation is very helpful indeed!

@xylar
Copy link
Contributor

xylar commented Aug 15, 2023

Some further clues.

Conda env

mamba create -y -n test "cdms2=3.1.5=*_21" shapely=2.0.1 esmf=8.4.2

(so the same as @tomvothecoder's cdms_shapely_4 as far as I can tell).

Tom's script:

#!/usr/bin/env python3

import cdms2
from shapely.geometry.polygon import LinearRing

geo = LinearRing(((-180, -90), (-180, 90), (180, 90), (180, -90), (-180, -90)))

geo.has_z

Result: segfault.

Replace cdms2 with esmpy

#!/usr/bin/env python3

import esmpy
from shapely.geometry.polygon import LinearRing

geo = LinearRing(((-180, -90), (-180, 90), (180, 90), (180, -90), (-180, -90)))

geo.has_z

Result: segfault.

Import shapely first, then cdms2 or esmpy

#!/usr/bin/env python3

from shapely.geometry.polygon import LinearRing
import esmpy

geo = LinearRing(((-180, -90), (-180, 90), (180, 90), (180, -90), (-180, -90)))

geo.has_z

Result: success!

@xylar
Copy link
Contributor

xylar commented Aug 15, 2023

My proposed fix is E3SM-Project/e3sm_diags#715. Please have a look and also test this out more extensively.

@tomvothecoder
Copy link
Collaborator

My proposed fix is E3SM-Project/e3sm_diags#715. Please have a look and also test this out more extensively.

Thank you @xylar! I'm glad you found a quick solution to get e3sm_diags working.

@chengzhuzhang
Copy link
Collaborator

I tested standalone e3sm_diags on both perlmutter and compy. Both runs are okay. @xylar I'm going to create a e3sm_diags rc3. I'm not sure if there is a viable path forward for the compy issue for building NCO? It seems like the last issue needs to be resolved...

@forsyth2
Copy link
Collaborator Author

Commit 4: Compy -- latest zppy dev (conda activate zppy_dev_pre_rc6), using rc10 packages
Code fails with the errors described in #474

Using rc12, I'm not actually encountering this error anymore. (Note: I am running zppy from e3sm_unified_1.9.0rc12_login rather than the zppy dev environment here. But the failing ncclimo call was always called from a Unified RC so I don't see how that would have changed anything).

@xylar
Copy link
Contributor

xylar commented Aug 17, 2023

@forsyth2, I undid changes I made from rc9 to rc10 in rc12. I am building the various spack packages with Gnu and OpenMPI again rather than building with Intel and installing NCO from conda even on compute nodes. So to me it is unsurprising that your issues are fixed.

The problem is that mine with ESMF_RegridWeightGen in MPAS-Analysis likely remain, so I have to live with certain configurations being broken for now.

@forsyth2
Copy link
Collaborator Author

For reference, this is how to create the versions lists/diffs @mahf708 shows in #475 (comment), #475 (comment), #475 (comment):

# On login node:
conda list > login_versions.txt
# On compute node
conda list > compute_versions.txt
# On either
diff login_versions.txt compute_versions.txt > diff.txt

@forsyth2
Copy link
Collaborator Author

Closing this pull request. Resolved by Unified rc12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: bug Bug fix (will increment patch version)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Errors on v2.3.0rc5 (E3SM Unified 1.9.0rc10)
5 participants