Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpiexec runs much slower than OpenMPI in Github Actions #6037

Closed
wkliao opened this issue May 31, 2022 · 28 comments
Closed

mpiexec runs much slower than OpenMPI in Github Actions #6037

wkliao opened this issue May 31, 2022 · 28 comments

Comments

@wkliao
Copy link
Contributor

wkliao commented May 31, 2022

I run two github action workflows for PnetCDF. One is using MPICH and the
other OpenMPI. They both run many small 4-process jobs (command: make ptest)
and all MPI jobs run on the same local host node. The MPICH version is always much
slower than OpenMPI. The times of the latest runs show

I have also tried to build MPICH 4.0.2 from source as part of workflow, but got a similar
timing results.

Is there a way to speed up mpiexec?

@hzhou
Copy link
Contributor

hzhou commented May 31, 2022

I have two tips:

  • export HWLOC_XMLFILE=path/to/hwloc.xml
    MPICH will probe hardware from each process that can be very slow at init time. Supply a pre-geneated hwloc config XML file should by pass it. https://www.open-mpi.org/projects/hwloc/doc/v2.3.0/a00353.php

  • If gpu awareness is not required, disable it by e.g. --without-cuda.

@wkliao
Copy link
Contributor Author

wkliao commented May 31, 2022

Could you please help me with an example for creating the xml file?
I wonder how come OpenMPI does not have such problem.

@hzhou
Copy link
Contributor

hzhou commented May 31, 2022

Could you please help me with an example for creating the xml file?
I believe if you build and install hwloc inside the GitHub action runner, you can run

lstopo --of xml hwloc.xml

to create the xml file. For reference: https://linux.die.net/man/1/lstopo

I wonder how come OpenMPI does not have such problem

I think OpenMPI only discovers hardware in mpiexec and send the hardware topology to processes via PMIx. Ken implemented a similar workaround in #5929. Do you run main branch MPICH or use an older release?

@wkliao
Copy link
Contributor Author

wkliao commented May 31, 2022

The HWLOC_XMLFILE approach does not improve. I am still getting a similar long time.
I am building mpich 4.0.2.

@hzhou
Copy link
Contributor

hzhou commented May 31, 2022

Could you try both ch3 and ch4 devices and see if there is any difference in timing?

@wkliao
Copy link
Contributor Author

wkliao commented May 31, 2022

I used --with-device=ch4 and make hangs at Making install in modules/yaksa.

*****************************************************
***
*** device      : ch4:ofi (embedded libfabric)
*** shm feature : auto
*** gpu support : disabled
***
  MPICH is configured with device ch4:ofi, which should work
  for TCP networks and any high-bandwidth interconnect
  supported by libfabric. MPICH can also be configured with
  "--with-device=ch4:ucx", which should work for TCP networks
  and any high-bandwidth interconnect supported by the UCX
  library. In addition, the legacy device ch3 (--with-device=ch3)
  is also available.
*****************************************************
Configuration completed.
Making install in src/mpl
ar: `u' modifier ignored since `D' is the default (see `U')
Making install in /home/runner/work/PnetCDF/PnetCDF/MPICH/mpich-4.0.2/modules/hwloc
Making install in include
Making install in hwloc
ar: `u' modifier ignored since `D' is the default (see `U')
Making install in /home/runner/work/PnetCDF/PnetCDF/MPICH/mpich-4.0.2/modules/json-c
Making install in .
ar: `u' modifier ignored since `D' is the default (see `U')
Making install in tests
Making install in modules/yaksa

@hzhou
Copy link
Contributor

hzhou commented May 31, 2022

Did you somehow hide the lines such as CC xxx.lo? make in yaksa is going to take a while, are you sure it is hanging?

@wkliao
Copy link
Contributor Author

wkliao commented Jun 1, 2022

I re-ran to build MPICH with ch4, which took 31 minutes.
PnetCDF's make ptest is still slow, taking 1 hour 9 minutes.

@hzhou
Copy link
Contributor

hzhou commented Jun 1, 2022

I just tried building PnetCDF on my local linux, and make ptest only took 38 sec. @wkliao Could you check the attached make log and see if I am missing anything?
t.log

@wkliao
Copy link
Contributor Author

wkliao commented Jun 1, 2022

It also runs very fast on my local redhat machine.
The issue is it runs very slow on github action workflow.

@hzhou
Copy link
Contributor

hzhou commented Jun 1, 2022

Do you have a link to the recent github action test log?

@wkliao
Copy link
Contributor Author

wkliao commented Jun 1, 2022

See below. You can also check their yml files for the configure options I used.

@hzhou
Copy link
Contributor

hzhou commented Jun 1, 2022

Somehow every test seems ran twice:

 make[2]: Entering directory '/home/runner/work/PnetCDF/PnetCDF/test/C'
===========================================================
    test/C: Parallel testing on 4 MPI processes
===========================================================
*** TESTING C   pres_temp_4D_wr for writing classic file           ------ pass
*** TESTING C   pres_temp_4D_rd for reading classic file           ------ pass
*** TESTING C   pres_temp_4D_wr for writing classic file           ------ pass
*** TESTING C   pres_temp_4D_rd for reading classic file           ------ pass
make[2]: Leaving directory '/home/runner/work/PnetCDF/PnetCDF/test/C'

@wkliao
Copy link
Contributor Author

wkliao commented Jun 1, 2022

This is because configure option --enable-burst_buffering is used.
It runs each test twice, once for using burst buffering feature, one not.

What puzzles me is why the same configure settings used in both
OpenMPI and MPICH and MPICH is much slower than OpenMPI.

@hzhou
Copy link
Contributor

hzhou commented Jun 1, 2022

Even OpenMPI's 15 min is a big puzzle if it only run on local computer for less than a minute. That's 15x, a much bigger puzzle. Any ideas?

@wkliao
Copy link
Contributor Author

wkliao commented Jun 1, 2022

I believe it is because github actions are running on a virtual environment.

@hzhou
Copy link
Contributor

hzhou commented Jun 1, 2022

I believe it is because github actions are running on a virtual environment.

I believe virtualization nowadays are pretty good, i.e. I would think it won't slow down by more than 2x. Looking at the compilation time, it seems reasonably fast.

@wkliao
Copy link
Contributor Author

wkliao commented Jun 1, 2022

If that is the case, then I have no idea why github actions is slower.
The issue remains: OpenMPI's mpiexec runs faster than MPICH.

@hzhou
Copy link
Contributor

hzhou commented Jun 1, 2022

Is there a way to show timestamp of each test?

@wkliao
Copy link
Contributor Author

wkliao commented Jun 1, 2022

Yes. On the top-right corner, click the gear icon and click show timestamps and full screen.

@raffenet
Copy link
Contributor

Just catching up here. Github Actions runners only have 2 virtual cores. The oversubscription might be really hurting. Could you test using the --with-device=ch3:sock configuration and see how it performs?

@raffenet
Copy link
Contributor

Just catching up here. Github Actions runners only have 2 virtual cores. The oversubscription might be really hurting. Could you test using the --with-device=ch3:sock configuration and see how it performs?

Also I guess I should try to clarify, why do we think that mpiexec is the slow part? Is there some additional breakdown of time spent showing that its mpiexec that is slow? Or are we just comparing MPICH vs. OpenMPI and their associated launchers?

@wkliao
Copy link
Contributor Author

wkliao commented Jun 29, 2022

The trick of -with-device=ch3:sock seems to work.
The time has been significantly reduced from 1h 30 m to 19 m.

@raffenet
Copy link
Contributor

raffenet commented Jun 30, 2022

The trick of -with-device=ch3:sock seems to work. The time has been significantly reduced from 1h 30 m to 19 m.

Thanks for confirming. This is another piece of evidence supporting adding a configuration in ch4 that can run without busy-polling. Whether that is ch4:sock or something else.

@wkliao
Copy link
Contributor Author

wkliao commented Jun 30, 2022

I will be happy to do some profiling. Let me know.
Otherwise, I can close this issue. Thanks!

@scottwittenburg
Copy link

This is another piece of evidence supporting adding a configuration in ch4 that can run without busy-polling.

Is there now another combination with ch4 that runs without busy-polling? After we saw the suggestion here, we tried an mpich build with ch3:sock:tcp, and it cut our testing time in CI by more than half.

cc: @vicentebolea

@hzhou
Copy link
Contributor

hzhou commented Nov 8, 2023

Is there now another combination with ch4 that runs without busy-polling?

No. It is still sitting on our TODO list.

@scottwittenburg
Copy link

Is there now another combination with ch4 that runs without busy-polling?

No. It is still sitting on our TODO list.

Ok, thanks, just checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants