Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault relion_refine_mpi #1154

Open
KrisJanssen opened this issue Jun 25, 2024 · 24 comments
Open

Segmentation Fault relion_refine_mpi #1154

KrisJanssen opened this issue Jun 25, 2024 · 24 comments

Comments

@KrisJanssen
Copy link

KrisJanssen commented Jun 25, 2024

Describe your problem

I created a docker image to benchmark running relion 4.0.1-commit-ex417f on multiple hosts in our organization.

For the benchmark, I use a standard dataset: ftp:https://ftp.mrc-lmb.cam.ac.uk/pub/scheres/relion_benchmark.tar.gz

The dockerfile is here: https://gist.github.com/KrisJanssen/7ff75ad91926e46daa767d71c48f7ced

So far, the resulting container ran fine on any system I threw it at, wheter on-premises or on some of our Azure VMs.

Today, I wanted to test the same image and job on a new on-premise system, ultimately resulting in a segmentation fault.

Environment:

  • OS: Red Hat Enterprise Linux release 8.7 (Ootpa) - 4.18.0-513.24.1.el8_9.x86_64
  • Docker: Docker version 26.0.1, build d260a54
  • Container OS: Ubuntu 20.04.6 LTS (focal)
  • MPI runtime: mpirun (Open MPI) 4.0.3
  • RELION version 4.0.1-commit-ec417f
  • Memory: 2000 GB
  • CPU: 8x AMD EPYC 7542 32-Core Processor
  • GPU: 8x GA100 [A100 SXM4 80GB] (rev a1)
  • GPU Driver: 550.54.15
  • Cuda: Build cuda_12.0.r12.0/compiler.32267302_0

Dataset:

  • ftp:https://ftp.mrc-lmb.cam.ac.uk/pub/scheres/relion_benchmark.tar.gz

Job options:

Error message:

Please cite the full error message as the example below.

Starting the job:

INFO: MPS server daemon started
INFO: 1436  GB of system memory free, pre-reading images
INFO: Running RELION with:
  8 GPUs
  17 MPI processes total
  2 MPI processes per GPU
  6 threads per worker process
+ mpirun --allow-run-as-root -n 17 --oversubscribe relion_refine_mpi --gpu --i Particles/shiny_2sets.star --ref emd_2660.map:mrc --firstiter_cc --ini_high 60 --ctf --ctf_corrected_ref --tau2_fudge 4 --K 6 -
-flatten_solvent --healpix_order 2 --sym C1 --iter 25 --particle_diameter 360 --zero_mask --oversampling 1 --offset_range 5 --offset_step 2 --norm --scale --random_seed 0 --pool 100 --dont_combine_weights_v
ia_disc --o /host_pwd/run.2024.06.25.22.19 --j 6 --preread_images
+ tee /host_pwd/run.2024.06.25.22.19/log.txt
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           XXX
  Local device:         qedr0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
[qelr_create_qp:746]create qp: failed on ibv_cmd_create_qp with 22
[qelr_create_qp:746]create qp: failed on ibv_cmd_create_qp with 22
[qelr_create_qp:746]create qp: failed on ibv_cmd_create_qp with 22
[qelr_create_qp:746]create qp: failed on ibv_cmd_create_qp with 22

<tons more of the same 'failed' message
 === RELION MPI setup ===
 + Number of MPI processes             = 17
 + Number of threads per MPI process   = 6
 + Total number of threads therefore   = 102
 + Leader  (0) runs on host            = XXX
 + Follower     1 runs on host            = XXX
 + Follower     2 runs on host            = XXX
 + Follower     3 runs on host            = XXX
 + Follower     4 runs on host            = XXX
 + Follower     5 runs on host            = XXX
 + Follower     6 runs on host            = XXX
 + Follower     7 runs on host            = XXX
 + Follower     8 runs on host            = XXX
 + Follower     9 runs on host            = XXX
 + Follower    10 runs on host            = XXX
 + Follower    11 runs on host            = XXX
 + Follower    12 runs on host            = XXX
 + Follower    13 runs on host            = XXX
 + Follower    14 runs on host            = XXX
 + Follower    15 runs on host            = XXX
 + Follower    16 runs on host            = XXX
 =================
[XXX.dir.ucb-group.com:00235] 33 more processes have sent help me
ssage help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[XXX.dir.ucb-group.com:00235] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
 uniqueHost XXX has 16 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 1 mapped to device 0
 Thread 1 on follower 1 mapped to device 0
 Thread 2 on follower 1 mapped to device 0
 Thread 3 on follower 1 mapped to device 0
 Thread 4 on follower 1 mapped to device 0
 Thread 5 on follower 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 2 mapped to device 0
 Thread 1 on follower 2 mapped to device 0
 Thread 2 on follower 2 mapped to device 0
 Thread 3 on follower 2 mapped to device 0
 Thread 4 on follower 2 mapped to device 0
 Thread 5 on follower 2 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 3 mapped to device 1
 Thread 1 on follower 3 mapped to device 1

<a bunch more MPI messages>

Then finally, it all goes pear-shaped:

The following warnings were encountered upon command-line parsing:
WARNING: Option --ctf_corrected_ref     is not a valid RELION argument
 Running CPU instructions in double precision.
WARNING: Particles/shiny_2sets.star seems to be from a previous version of Relion. Attempting conversion...
         You should make sure metadata in the optics group table after conversion is correct.
 Estimating initial noise spectra from 1000 particles
   2/   2 sec ............................................................~~(,_,">
[XXX:00419] *** Process received signal ***
[XXX:00419] Signal: Segmentation fault (11)
[XXX:00419] Signal code: Address not mapped (1)
[XXX:00419] Failing at address: (nil)
[XXX:00419] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f46dda7e420]
[XXX:00419] [ 1] /usr/lib/x86_64-linux-gnu/libibverbs.so.1(ibv_dereg_mr+0xe)[0x7f46dc23003e]
[XXX:00419] [ 2] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x536c7)[0x7f46d6dbe6c7]
[XXX:00419] [ 3] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x758d9)[0x7f46d6de08d9]
[XXX:00419] [ 4] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x759c1)[0x7f46d6de09c1]
[XXX:00419] [ 5] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x792b5)[0x7f46d6de42b5]
[XXX:00419] [ 6] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x752d3)[0x7f46d6de02d3]
[XXX:00419] [ 7] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(+0x4c8e)[0x7f46dc248c8e]
[XXX:00419] [ 8] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_cm.so(+0x2958)[0x7f46dc26c958]
[XXX:00419] [ 9] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3c0)[0x7f46dddb9c50]
[XXX:00419] [10] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7f46dddba061]
[XXX:00419] [11] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7f46d6c23dae]
[XXX:00419] [12] /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7f46ddd7cb10]
[XXX:00419] [13] relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x55f6416e5566]
[XXX:00419] [14] relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x55f6416cbfa8]
[XXX:00419] [15] relion_refine_mpi(main+0x71)[0x55f641683d11]
[XXX:00419] [16] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f46dd4fc083]
[XXX:00419] [17] relion_refine_mpi(_start+0x2e)[0x55f64168758e]
[XXX:00419] *** End of error message ***
[qelr_poll_cq_req:2103]Error: POLL CQ with ROCE_CQE_REQ_STS_WORK_REQUEST_FLUSHED_ERR. QP icid=0x11b1
relion_refine_mpi: prov/verbs/src/verbs_cq.c:404: fi_ibv_poll_cq: Assertion `wre && (wre->ep || wre->srq)' failed.
[XXX:00420] *** Process received signal ***
[XXX:00420] Signal: Aborted (6)
[XXX:00420] Signal code:  (-6)
[qelr_poll_cq_req:2103]Error: POLL CQ with ROCE_CQE_REQ_STS_WORK_REQUEST_FLUSHED_ERR. QP icid=0x11e1
relion_refine_mpi: prov/verbs/src/verbs_cq.c:404: fi_ibv_poll_cq: Assertion `wre && (wre->ep || wre->srq)' failed.
[XXX:00417] *** Process received signal ***
[XXX:00417] Signal: Aborted (6)
[XXX:00417] Signal code:  (-6)
[qelr_poll_cq_req:2103]Error: POLL CQ with ROCE_CQE_REQ_STS_WORK_REQUEST_FLUSHED_ERR. QP icid=0x121d
relion_refine_mpi: prov/verbs/src/verbs_cq.c:404: fi_ibv_poll_cq: Assertion `wre && (wre->ep || wre->srq)' failed.
[XXX:00418] *** Process received signal ***
[XXX:00418] Signal: Aborted (6)
[XXX:00418] Signal code:  (-6)
[qelr_poll_cq_req:2103]Error: POLL CQ with ROCE_CQE_REQ_STS_WORK_REQUEST_FLUSHED_ERR. QP icid=0x126d
relion_refine_mpi: prov/verbs/src/verbs_cq.c:404: fi_ibv_poll_cq: Assertion `wre && (wre->ep || wre->srq)' failed.
[qelr_poll_cq_req:2103]Error: POLL CQ with ROCE_CQE_REQ_STS_WORK_REQUEST_FLUSHED_ERR. QP icid=0x129b
relion_refine_mpi: prov/verbs/src/verbs_cq.c:404: fi_ibv_poll_cq: Assertion `wre && (wre->ep || wre->srq)' failed.
[XXX:00447] *** Process received signal ***
[XXX:00447] Signal: Aborted (6)
[XXX:00447] Signal code:  (-6)
[XXX:00435] *** Process received signal ***
[XXX:00435] Signal: Aborted (6)
[XXX:00435] Signal code:  (-6)
[XXX:00420] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f7e23b9f420]
[XXX:00420] [XXX:00417] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f10c7de6420]
[XXX:00417] [ 1] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f10c788300b]
[XXX:00417] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f7e2363c00b]
[XXX:00420] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f7e2361b859]
[XXX:00420] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7f7e2361b729]
[XXX:00420] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f10c7862859]
[XXX:00417] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7f10c7862729]
[XXX:00417] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x33fd6)[0x7f10c7873fd6]
[XXX:00417] [ 5] [XXX:00418] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fc06a39b420]
[XXX:00418] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x33fd6)[0x7f7e2362cfd6]
[XXX:00420] [ 5] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x51029)[0x7f7e20f08029]
[XXX:00420] [ 6] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x5144d)[0x7f7e20f0844d]
[XXX:00420] [ 7] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x7af59)[0x7f7e20f31f59]
[XXX:00420] [ 8] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x28475)[0x7f7e20edf475]
[XXX:00420] [ 9] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x51029)[0x7f10c512d029]
[XXX:00417] [ 6] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x5144d)[0x7f10c512d44d]
[XXX:00417] [ 7] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x7af59)[0x7f10c5156f59]
[XXX:00417] [ 8] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x28475)[0x7f10c5104475]
[XXX:00417] [ 9] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x27b02)[0x7f10c5103b02]
[XXX:00417] [10] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(ompi_mtl_ofi_progress_no_inline+0xba)[0x7f10c59900da]
[XXX:00417] [11] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x27b02)[0x7f7e20edeb02]
[XXX:00420] [10] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(ompi_mtl_ofi_progress_no_inline+0xba)[0x7f7e2176b0da]
[XXX:00420] [11] /usr/lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7f7e234b9854]
/usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fc069e3800b]
[XXX:00418] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fc069e17859]
[XXX:00418] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7fc069e17729]
[XXX:00418] [ 4] /usr/lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7f10c7700854]
[XXX:00417] [12] [XXX:00420] [12] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x135)[0x7f7e23e85905]
[XXX:00420] [13] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x135)[0x7f10c80cc905]
[XXX:00417] [13] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x33fd6)[0x7fc069e28fd6]
[XXX:00418] [ 5] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x51029)[0x7fc06363e029]
[XXX:00418] [ 6] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x5144d)[0x7fc06363e44d]
[XXX:00418] [ 7] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x7af59)[0x7fc063667f59]
[XXX:00418] [ 8] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x28475)[0x7fc063615475]
[XXX:00418] [ 9] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x27b02)[0x7fc063614b02]
[XXX:00418] [10] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(ompi_mtl_ofi_progress_no_inline+0xba)[0x7fc0683450da]
[XXX:00418] [11] [XXX:00447] [ 0] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3e4)[0x7f7e23edac74]
[XXX:00420] [14] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7f7e23edb061]
[XXX:00420] [15] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x26f)[0x7f10c8121aff]
[XXX:00417] [14] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7f10c8122061]
[XXX:00417] /usr/lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7fc069cb5854]
[XXX:00418] [12] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f561b095420]
[XXX:00447] [ 1] [XXX:00435] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7faedc1db420]
[XXX:00435] [ 1] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7f7e20d47dae]
[XXX:00420] [16] [15] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7f10c4f6cdae]
[XXX:00417] [16] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7faedbc7800b]
[XXX:00435] [ 2] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait_all+0xe5)[0x7fc06a681e25]
[XXX:00418] [13] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x473)[0x7fc06a6d6d03]
[XXX:00418] /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7f10c80e4b10]
[XXX:00417] [17] /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7f7e23e9db10]
[XXX:00420] [17] [14] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7fc06a6d7061]
[XXX:00418] [15] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7fc063522dae]
[XXX:00418] [16] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f561ab3200b]
[XXX:00447] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f561ab11859]
[XXX:00447] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7f561ab11729]
[XXX:00447] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7faedbc57859]
[XXX:00435] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7faedbc57729]
[XXX:00435] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x33fd6)[0x7faedbc68fd6]
[XXX:00435] [ 5] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x51029)[0x7faed9522029]
[XXX:00435] [ 6] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x5144d)[0x7faed952244d]
relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x55e1ef47b566]
[XXX:00420] [18] relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x55dcd5e5e566]
[XXX:00417] [18] [XXX:00435] [ 7] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x7af59)[0x7faed954bf59]
[XXX:00435] [ 8] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x28475)[0x7faed94f9475]
[XXX:00435] [ 9] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x27b02)[0x7faed94f8b02]
[XXX:00435] [10] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(ompi_mtl_ofi_progress_no_inline+0xba)[0x7faed9d850da]
[XXX:00435] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x33fd6)[0x7f561ab22fd6]
[XXX:00447] [ 5] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x51029)[0x7f56183dc029]
[XXX:00447] [ 6] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x5144d)[0x7f56183dc44d]
[XXX:00447] [ 7] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x7af59)[0x7f5618405f59]
[XXX:00447] [ 8] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x28475)[0x7f56183b3475]
[XXX:00447] [ 9] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x27b02)[0x7f56183b2b02]
[XXX:00447] [10] relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x55e1ef461fa8]
[XXX:00420] [19] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(ompi_mtl_ofi_progress_no_inline+0xba)[0x7f5618c3f0da]
[XXX:00447] [11] /usr/lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7f561a9af854]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7fc06a699b10]
[XXX:00418] [17] /usr/lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7faedbaf5854]
[XXX:00435] [12] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x135)[0x7faedc4c1905]
[XXX:00435] [13] relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x55dcd5e44fa8]
[XXX:00417] [19] [XXX:00447] [12] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x135)[0x7f561b37b905]
[XXX:00447] [13] relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x55c57ef5e566]
[XXX:00418] [18] relion_refine_mpi(main+0x71)[0x55e1ef419d11]
[XXX:00420] [20] relion_refine_mpi(main+0x71)[0x55dcd5dfcd11]
[XXX:00417] [20] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f7e2361d083]
[XXX:00420] [21] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3e4)[0x7faedc516c74]
[XXX:00435] [14] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7faedc517061]
[XXX:00435] [15] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7faed9361dae]
[XXX:00435] [16] relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x55c57ef44fa8]
[XXX:00418] [19] relion_refine_mpi(main+0x71)[0x55c57eefcd11]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f10c7864083]
[XXX:00417] [21] relion_refine_mpi(_start+0x2e)[0x55dcd5e0058e]
[XXX:00417] *** End of error message ***
[XXX:00418] [20] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fc069e19083]
[XXX:00418] [21] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3e4)[0x7f561b3d0c74]
[XXX:00447] [14] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7f561b3d1061]
[XXX:00447] [15] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7f561821bdae]
[XXX:00447] [16] relion_refine_mpi(_start+0x2e)[0x55e1ef41d58e]
[XXX:00420] *** End of error message ***
/usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7faedc4d9b10]
[XXX:00435] [17] relion_refine_mpi(_start+0x2e)[0x55c57ef0058e]
[XXX:00418] *** End of error message ***
/usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7f561b393b10]
[XXX:00447] [17] relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x562017e91566]
[XXX:00435] [18] relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x562017e77fa8]
[XXX:00435] [19] relion_refine_mpi(main+0x71)[0x562017e2fd11]
[XXX:00435] [20] relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x557e5d590566]
[XXX:00447] [18] relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x557e5d576fa8]
[XXX:00447] [19] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7faedbc59083]
[XXX:00435] [21] relion_refine_mpi(_start+0x2e)[0x562017e3358e]
[XXX:00435] *** End of error message ***
relion_refine_mpi(main+0x71)[0x557e5d52ed11]
[XXX:00447] [20] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f561ab13083]
[XXX:00447] [21] relion_refine_mpi(_start+0x2e)[0x557e5d53258e]
[XXX:00447] *** End of error message ***
@KrisJanssen
Copy link
Author

No ideas from the community?

@rahelwoldeyes
Copy link

rahelwoldeyes commented Jul 25, 2024

Hi @KrisJanssen I am seeing the same error message. Were you able to fix it?
RELION version: 5.0-beta-3-commit-7a062e

@KrisJanssen
Copy link
Author

@rahelwoldeyes : unfortunately not, really hoping the devs or other docs with knowledge of cuda and MPI might be able to shed some light…

@rahelwoldeyes
Copy link

I hope so too. It's interesting that the problem started with commit fa8dce3. Commit 90d239e works as expected. Unfortunately, I couldn't determine the root cause

@rahelwoldeyes
Copy link

To further support this issue, the error consistently appears on our cluster regardless of whether Relion is containerized or installed directly.

Describe your problem

Refine3D consistently crashes with a segmentation fault at the beginning of the first iteration, regardless of MPI or GPU use. This occurs for datasets larger than trivially small ones (~500 particles), indicating a potential memory management issue. The dataset appears to be read six times. The problem was introduced in commit fa8dce3, as commit 90d239e works correctly.

Environment:
OS: RHEL 8.6
MPI runtime: mpirun (Open MPI) 4.1.6rc4
RELION version: 5.0-beta-3-commit-d08e4d
Memory: 11264MiB
GPU: NVIDIA GeForce RTX 2080 Ti
GPU Driver: 535.161.07
CUDA: 12.2

Dataset:
Box size: 48 px
Pixel size: 8.53 Å/px #binned 4
Number of particles: 4,000

Job options:
Type of job: Refine3D
Number of MPI processes: 5
Number of threads: 4
Full command:

  mpirun --mca btl self,vader,tcp --mca pml ob1 `which relion_refine_mpi` --o Refine3D/job001/run --auto_refine --split_random_halves --ios optimisation_set.star --ref init/intModel.mrc --firstiter_cc --trust_ref_size --ini_high 60 --dont_combine_weights_via_disc --pool 8 --pad 2  --ctf --particle_diameter 340 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 4 --gpu ""  --pipeline_control Refine3D/job001/   ```

Error message:

Please cite the full error message as the example below.

 WARNING: tomogram S65-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_004.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S65-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_006.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_004.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_005.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S65-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S65-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_006.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_004.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_004.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_003.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_005.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_006.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S65-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_006.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_006.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_005.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_003.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_005.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_005.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S65-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_006.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_003.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_003.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S65-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S65-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_005.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_006.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_006.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_005.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_005.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S65-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_004.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_006.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_005.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_003.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S65-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_006.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_005.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S65-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_004.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_006.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_005.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_003.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_002.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S65-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S39-TS_006.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_001.tomostar has relion-4 definition of projection matrices; converting them now...
 WARNING: tomogram S26-TS_005.tomostar has relion-4 definition of projection matrices; converting them now...
[turing023:2658793] *** Process received signal ***
[turing023:2658793] Signal: Segmentation fault (11)
[turing023:2658793] Signal code: Address not mapped (1)
[turing023:2658793] Failing at address: 0x80
[turing023:2658793] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7efe733c8520]
[turing023:2658793] [ 1] /opt/relion/bin/relion_refine_mpi(+0x4190d0)[0x55c8c15480d0]
[turing023:2658793] [ 2] /opt/relion/bin/relion_refine_mpi(+0x420ca5)[0x55c8c154fca5]
[turing023:2658793] [ 3] /opt/relion/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xf6)[0x55c8c1551326]
[turing023:2658793] [ 4] /opt/relion/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesPvi+0x43)[0x55c8c14d5ed3]
[turing023:2658793] [ 5] /opt/relion/bin/relion_refine_mpi(+0x3a6f5c)[0x55c8c14d5f5c]
[turing023:2658793] [ 6] /lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x46)[0x7efe735e2a16]
[turing023:2658793] [ 7] /opt/relion/bin/relion_refine_mpi(_ZN11MlOptimiser24expectationSomeParticlesEll+0xd45)[0x55c8c14b5525]
[turing023:2658793] [ 8] /opt/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x1eaa)[0x55c8c1278a2a]
[turing023:2658793] [ 9] /opt/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x1d3)[0x55c8c128bd03]
[turing023:2658793] [10] /opt/relion/bin/relion_refine_mpi(main+0x85)[0x55c8c1234565]
[turing023:2658793] [11] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7efe733afd90]
[turing023:2658793] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7efe733afe40]
[turing023:2658793] [13] /opt/relion/bin/relion_refine_mpi(_start+0x25)[0x55c8c1238005]
[turing023:2658793] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 2658663 on node turing023 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@DrJesseHansen
Copy link

I am also seeing this problem, it is still open I see.

@rahelwoldeyes
Copy link

Yes, this is still an issue for me. It is good to know the problem is related to 2D extraction in Warp Linux #1181 .

@DrJesseHansen
Copy link

interesting, your particles are also coming from Warp Linux then? Is there any workaround? I tried older versions of RELION but I can only go back so far because only the latest versions work on our latest debian system.

p.s. I recognize your name from GRC, we chatted briefly! Cool to see familiar names popping up after that :)

@KrisJanssen
Copy link
Author

@rahelwoldeyes and @DrJesseHansen : not sure about the warp Linux thing: I can trigger the issue with any generic data processed in a generic Ubuntu docker container …

@rahelwoldeyes
Copy link

@DrJesseHansen My particles are from Warp Linux. I haven't found a workaround yet. I'm unsure if downgrading to an older RELION-5 version is advisable, given the bug fixes and improvements in newer versions. @KrisJanssen Thanks for pointing out that this isn't a Warp Linux-specific problem.

p.s. I recognize your name from GRC, we chatted briefly! Cool to see familiar names popping up after that :)

@DrJesseHansen I remember! It's nice to see you here. The GRS/GRC is a great way to connect with people.

@DrJesseHansen
Copy link

I figured out how to somewhat get around this issue. It turns out there were somehow some bad particles in the dataset. I’m guessing this is a WARP issue, since I extracted the exact same particles in the RELION pipeline and had no issue. Anyway, I split my particles into 10 subsets and refined each: 3 failed with this same error, 7 ran okay. Sub-splitting can help you reduce "waste" when throwing out the bad particles. I have no idea what is wrong with those problematic particles causing the crash.

Now when I run the jobs I am getting a different error! haha. This one is about "corrupted size vs previous size". issue #794

p.s. I am running relion 5 commit 6331fe.

@KrisJanssen
Copy link
Author

@scheres : might you know what could have happened between fa8dce3 and 90d239e to cause this?

@scheres
Copy link
Contributor

scheres commented Sep 2, 2024

Hmm, could someone try the following: line 1534 of src/ml_optimiser.cpp in the new version is this:

if (do_write_data && (mymodel.data_dim == 3 || mydata.is_tomo) )

And might need to be changed to this:

if (do_write_data && !optimisationSet.isEmpty() && (mymodel.data_dim == 3 || mydata.is_tomo) )

Please recompile the code and test again. Not sure how this is related to particles from Warp, but let's give it a go...

@rahelwoldeyes
Copy link

@scheres Thanks for the suggestion, but it didn't fix the problem for me. I get the same error message. How did it go for you @KrisJanssen @DrJesseHansen?

@scheres
Copy link
Contributor

scheres commented Sep 3, 2024

Could you then carefully compare the star files you get from warp with those you get when extracting particles in relion?

@DrJesseHansen
Copy link

hi,

comparing the star files is interesting. Optics tables are identical except Warp has AC at 0.07 and Relion at 0.1. The data tables though have some differences (see image below):

  • the main difference seems to be in the way the angles are handled. In Warp for some reason the angles are in the THOUSANDS. Also the angles are defined in Warp by rlnAngleRot/tilt/psi rather thean rlnTomoSubtomogramRot/Tilt/Psi.

  • Warp file has rlnCoordinateXYZ, relion of course has rlnCenteredCoordinateXYZAngst. I have a script to convert the former to the latter, and the coordinates do match up (except for the Z, because in warp/linux the physical handedness got flipped so I inverted the Z coords).

I should mention too that if I generate 3D subvols in Warp then use relion_reconstruct I actually do get a reasonable looking reconstruction, however when I do the same with 2D images and relion_tomo_reconstruct (see command below) it outputs merged.mrc as an empty volume.

relion_tomo_reconstruct_particle --i reextracted_bin8_2D_optimisation_set.star --theme classic --o Reconstruct/job001/ --b 40 --bin 8 --j 1 --j_out 1 --j_in 1 --sym C1

image

Jesse

@rkjensen
Copy link

rkjensen commented Sep 11, 2024

@DrJesseHansen

There was an issue with Euler angle conversion in WarpTools. It has been fixed with 2.0.0dev26 and I at leaset now get the same angles with RELION5 and WarpTools (warpem/warp#227)

@rkjensen
Copy link

I still get the error though, with the right euler angles. Anyone have any ideas about a smart way to check which tomogram the offending particle is from (I have >350 tomograms, and I don't want to run too many tests)

@rkjensen
Copy link

@rahelwoldeyes @DrJesseHansen @KrisJanssen
I found the issue at least for me: When you have particles in the blurred corners of your tomogram you will get the _rlnTomoVisbleFrames to be [0,0 ..., 0] which I don't think RELION can handle.

I ran a script removing all particles that are found on only 0 or 1 tilts, and afterwards it worked for me. See also: warpem/warp#243

@DrJesseHansen
Copy link

DrJesseHansen commented Sep 11, 2024

WOW This worked! Thank you! Amazing.... can I ask, how did you figure this out?
Also what exactly does "_rlnTomoVisibleFrames" refer to?

@rkjensen
Copy link

I think _rlnTomoVisibleFrames means in which of your tilts is the particle seen.

How I figured it out was a bit of a coincidence... I was just looking through the star file and saw that I had a particle (luckily in the second tomogram from the top) that had all zeroes in the _rlnTomoVisibleFrames list, whereas most particles had all ones, and thought this might be the issue. It worked for my first star file, but not for the second. So there I decided to also remove the particles only seen on one tilt, and that seems to do the trick.

@rahelwoldeyes
Copy link

Wow, thank you @rkjensen and everyone! It works!!

@DrJesseHansen
Copy link

hi all, just want to update here.

This does seem to allow the job to run, but the result is still suboptimal and I can't figure out why.

When I extract 3D volumes rather than 2D I get the following outcome: I can do a relion_reconstruct and it gives a very reasonable looking volume, suggesting the data are fine. When I try to refine these 3D subvolumes I get an error I have not seen in 5 years #582.

Instead if I try extracting with 2D images I get the following outcome: relion_reconstruct_particle gives an empty volume, already not a good sign. When I do a 3D refine I see my map getting progressively worse, the reference turns more faint with each iteration then after it 5 it's an empty volume.

any thoughts?

best

Jesse

@DrJesseHansen
Copy link

update: I figured it out.

When I run the 2D particles with a refine job it always gives the weird result where the density eventually disappears, but if I run a 3D classification with 1 class then it works. Not sure why this happens, but it did the trick. Now I can use the output particles and volume from 3D classification to continue with subsequent steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants