Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load balancing on multiple nodes causing crash #709

Open
cw646 opened this issue Mar 12, 2021 · 1 comment
Open

Load balancing on multiple nodes causing crash #709

cw646 opened this issue Mar 12, 2021 · 1 comment

Comments

@cw646
Copy link
Contributor

cw646 commented Mar 12, 2021

Describe the bug
When using multiple MPI nodes, the load balancing step will cause a crash.

To Reproduce
Steps to reproduce the behavior:

  1. Set MPI nodes to >1
  2. Set load-balancing steps to < total number of steps.

Expected behavior
Should not cause the simulation to halt.

Runtime environment (please complete the following information):

  • Develop Branch
  • Error occurs on TACC: Frontera and Cambridge HPC: CSD3 (Tested on CCLAKE and SKYLAKE).
    • Both compiled with Intel compilers, vtk, and OpenMP 5.0/GCC > 9.
@kks32
Copy link
Contributor

kks32 commented Mar 17, 2021

I noticed when enabling load balancing, we get the following error at the end of the mpm iteration:

 Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fa790bbc038)
==== backtrace (tid: 695966) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7f6f69703524]
 1  /lib64/libucs.so.0(+0x290cd) [0x7f6f697060cd]
 2  /lib64/libucs.so.0(+0x292aa) [0x7f6f697062aa]
==== backtrace (tid: 695965) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7fb40ebfa524]
 1  /lib64/libucs.so.0(+0x290cd) [0x7fb40ebfd0cd]
 2  /lib64/libucs.so.0(+0x292aa) [0x7fb40ebfd2aa]
 3  /lib64/libpthread.so.0(+0x141e0) [0x7fb4b3a2a1e0]
 4  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(+0x132313) [0x7fb4b3b75313]
 5  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(PMPI_Buffer_detach+0x8c) [0x7fb4b3b7867c]
 6  ./mpm() [0x414a74]
 7  /lib64/libc.so.6(__libc_start_main+0xf2) [0x7fb4a3f6b1e2]
 8  ./mpm() [0x417b1e]
=================================
 3  /lib64/libpthread.so.0(+0x141e0) [0x7f700e5331e0]
 4  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(+0x132313) [0x7f700e67e313]
 5  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(PMPI_Buffer_detach+0x8c) [0x7f700e68167c]
 6  ./mpm() [0x414a74]
 7  /lib64/libc.so.6(__libc_start_main+0xf2) [0x7f6ffea741e2]
==== backtrace (tid: 695967) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7fa808fa2524]
 1  /lib64/libucs.so.0(+0x290cd) [0x7fa808fa50cd]
 2  /lib64/libucs.so.0(+0x292aa) [0x7fa808fa52aa]
 3  /lib64/libpthread.so.0(+0x141e0) [0x7fa8addd21e0]
 4  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(+0x132313) [0x7fa8adf1d313]
 5  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(PMPI_Buffer_detach+0x8c) [0x7fa8adf2067c]
 6  ./mpm() [0x414a74]
 7  /lib64/libc.so.6(__libc_start_main+0xf2) [0x7fa89e3131e2]
 8  ./mpm() [0x417b1e]
=================================
 8  ./mpm() [0x417b1e]
=================================
[caee-userk:695964:0:695964] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f0112f42038)
==== backtrace (tid: 695964) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7f018b328524]
 1  /lib64/libucs.so.0(+0x290cd) [0x7f018b32b0cd]
 2  /lib64/libucs.so.0(+0x292aa) [0x7f018b32b2aa]
 3  /lib64/libpthread.so.0(+0x141e0) [0x7f02301581e0]
 4  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(+0x132313) [0x7f02302a3313]
 5  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(PMPI_Buffer_detach+0x8c) [0x7f02302a667c]
 6  ./mpm() [0x414a74]
 7  /lib64/libc.so.6(__libc_start_main+0xf2) [0x7f02206991e2]
 8  ./mpm() [0x417b1e]
=================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 695964 RUNNING AT caee-userk
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 695965 RUNNING AT caee-userk
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 695966 RUNNING AT caee-userk
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 3 PID 695967 RUNNING AT caee-userk
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

This maybe the cause or a side-effect. However, the result looks good in 2D and doesn't crash:
Step 1:
image
Step 2:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants