Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem importing bmtk.analyzer.compartment #327

Open
moravveji opened this issue Sep 20, 2023 · 2 comments
Open

Problem importing bmtk.analyzer.compartment #327

moravveji opened this issue Sep 20, 2023 · 2 comments

Comments

@moravveji
Copy link

I have pip installed BMTK version 1.0.8 on our HPC cluster, running on Rocky8 OS and with Intel Icelake CPUs.
When I start an interactive job with 16 tasks, I fail to import the bmtk.analyzer.compartment package:

$ nproc
16
$ module use /apps/leuven/rocky8/icelake/2022b/modules/all
$ module load BMTK/1.0.8-foss-2022b
$ python
Python 3.10.8 (main, Jul 13 2023, 22:10:28) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bmtk
>>> import bmtk.analyzer.compartment
[m28c27n1:3237025] OPAL ERROR: Unreachable in file ext3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[m28c27n1:3237025] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

I have built BMTK/1.0.8-foss-2022b (and all its dependencies) against OpenMPI/4.1.4-GCC-12.2.0 module. However, this specific OpenMPI module is not built with Slurm support. That's why parallel applications which are launched using srun would spit out the OPAL error message above.

I would like to ask if there exists an environment variable to choose how the tasks would be launched? So that I can choose to use mpirun directly instead of srun.

@kaeldai
Copy link
Collaborator

kaeldai commented Sep 21, 2023

Hi @moravveji, BMTK it-self does not directly call srun or mpirun. It uses standard mpi4py library which relies on your locally installed version of OpenMPI. We've ran large bmtk simulation using both Moab/Torque and Slurm, although how to actually execute them will be different for each cluster.

One thing to try is to create a python script and run directly from the prompt using mpirun (or mpiexec), so

$ mpirun -np 16 python my_bmtk_script.py

Unfortunately, whatever you do will no longer be interactive, and I don't think you can start-up a shell using mpirun (or alteast I've never seen it done before). If you're using Moab I think you can use the qsub -I option to get an interactive shell, but I haven't tried it myself.

Another option to try is using/compiling a different version of OpenMPI. If you access to anaconda, it might be worth creating a test environment and installing OpenMPI/MPICH2. I believe that when it installs it will try to find the appropriate workload manager options on the system, and if there is a slurm manager on your hpc, will install with PMI support. Although in my experience it doesn't always work, especially if slurm is installed in a non-standard way.

@moravveji
Copy link
Author

Thanks @kaeldai for your comments.
I can already share few thoughts based on our recent try-and-error tests:

  • When installing bmtk via conda, the MPICH2 implementation of MPI is by default downloaded, which does not actually pick up the local scheduler (Slurm)
  • However, the mpi4py from the Intel channel does correctly pick up Slurm. However, the dependency requirements for other tools distributed in the bmtk environment could not be fully satisfied, because all the necessary tools were not consistently available from the Intel's (ana)conda channel; hence, that was no-GO for us
  • Instead, I tried to import the bmtk.analyzer.compartment package via a batch job (i.e. using sbatch). This time, the OpenMPI runtime properly spawns processes, and the error above does not appear anymore. The reason for this behavior is that our build of Slurm does support PMI-2, however, our OpenMPI was not configured to make use of PMI support. As a result of that, interactive jobs/tasks launched via srun fail with the error message above

So, the take home message is to avoid using bmtk in an interactive session (when OpenMPI is not compiled with PMI{2,x} support).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants