Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: spatial: long hangs using Qhull in SciPy 1.13.0, in Colour CI #20613

Open
KelSolaar opened this issue Apr 29, 2024 · 21 comments
Open

BUG: spatial: long hangs using Qhull in SciPy 1.13.0, in Colour CI #20613

KelSolaar opened this issue Apr 29, 2024 · 21 comments
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.spatial

Comments

@KelSolaar
Copy link

KelSolaar commented Apr 29, 2024

Describe your issue.

Hello,

We have seen our Windows Github Actions build times jumping from 15m to 4h15m after what seems an update from Scipy 1.12.0 to Scipy 1.13.0:

15m: https://github.com/colour-science/colour/actions/runs/8489605040/job/23259884769
4h15m: https://github.com/colour-science/colour/actions/runs/8845206377/job/24288653119

All the functions affected are touching Qhull via our colour.volume.is_within_mesh_volume definition:

2024-04-26T12:28:33.7956840Z ============================= slowest 5 durations =============================
2024-04-26T12:28:33.7957585Z 9020.65s call     colour/volume/tests/test_spectrum.py::TestIsWithinVisibleSpectrum::test_is_within_visible_spectrum
2024-04-26T12:28:33.7958454Z 8888.35s call     colour/volume/spectrum.py::colour.volume.spectrum.is_within_visible_spectrum
2024-04-26T12:28:33.7959562Z 4397.23s call     colour/volume/tests/test_rgb.py::TestRGB_colourspaceVisibleSpectrumCoverageMonteCarlo::test_RGB_colourspace_visible_spectrum_coverage_MonteCarlo
2024-04-26T12:28:33.7960670Z 3785.15s call     colour/volume/rgb.py::colour.volume.rgb.RGB_colourspace_visible_spectrum_coverage_MonteCarlo
2024-04-26T12:28:33.7961663Z 713.57s call     colour/volume/tests/test_pointer_gamut.py::TestIsWithinPointerGamut::test_is_within_pointer_gamut
def is_within_mesh_volume(
    points: ArrayLike, mesh: ArrayLike, tolerance: float = 100 * EPSILON
) -> NDArrayFloat:
    """
    Return whether given points are within given mesh volume using Delaunay
    triangulation.

    Parameters
    ----------
    points
        Points to check if they are within ``mesh`` volume.
    mesh
        Points of the volume used to generate the Delaunay triangulation.
    tolerance
        Tolerance allowed in the inside-triangle check.

    Returns
    -------
    :class:`numpy.ndarray`
        Whether given points are within given mesh volume.

    Examples
    --------
    >>> mesh = np.array(
    ...     [
    ...         [-1.0, -1.0, 1.0],
    ...         [1.0, -1.0, 1.0],
    ...         [1.0, -1.0, -1.0],
    ...         [-1.0, -1.0, -1.0],
    ...         [0.0, 1.0, 0.0],
    ...     ]
    ... )
    >>> is_within_mesh_volume(np.array([0.0005, 0.0031, 0.0010]), mesh)
    array(True, dtype=bool)
    >>> a = np.array([[0.0005, 0.0031, 0.0010], [0.3205, 0.4131, 0.5100]])
    >>> is_within_mesh_volume(a, mesh)
    array([ True, False], dtype=bool)
    """

    triangulation = Delaunay(mesh)

    simplex = triangulation.find_simplex(points, tol=tolerance)
    simplex = np.where(simplex >= 0, True, False)

    return simplex

I saw a reference to a Qhull related change in the mailing list here: https://mail.python.org/archives/list/[email protected]/thread/V25I4CIN556ZKKBIA2Q2OLVFGVYVYDEB/ but not the release notes on Github.

Any pointers would be appreciated, we will probably ban this version.

Cheers,

Thomas

Reproducing Code Example

Hard to repro, would need to run Colour's above tests on Windows using 1.13.0.

Error message

N/A

SciPy/NumPy/Python version and system information

1.13.0 vs 1.12.0 on Windows
@KelSolaar KelSolaar added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Apr 29, 2024
@lucascolley
Copy link
Member

hmm, nothing obvious sticks out to me in the git history. Maybe gh-20337 @ev-br ?

@lucascolley lucascolley added scipy.spatial Build issues Issues with building from source, including different choices of architecture, compilers and OS labels Apr 29, 2024
@ev-br
Copy link
Member

ev-br commented Apr 30, 2024

Maybe gh-20337?

No way it can explain the runtime change from 15min to 4 hrs.

Something else must have changed. I'd first verify the "new" run time is stable, then try to replicate locally. Then bisect.

@KelSolaar
Copy link
Author

KelSolaar commented Apr 30, 2024 via email

@KelSolaar
Copy link
Author

KelSolaar commented Apr 30, 2024

Confirming it is coming from 1.13.0, I banned the version and did a fresh build in ~15mins using 1.12.0: https://github.com/colour-science/colour/actions/runs/8890511725/job/24410682802

@ev-br
Copy link
Member

ev-br commented Apr 30, 2024

  1. Do you see it on other platforms, or can you confirm it's windows specific?

  2. Can you reproduce this locally?

@KelSolaar
Copy link
Author

Hey @ev-br,

It is Windows only and I cannot reproduce locally as I'm developing on macOs.

@ev-br
Copy link
Member

ev-br commented Apr 30, 2024

Just to confirm: with scipy 1.13 the Windows CI run takes 4hrs and MacOS and Linux runs take 15 mins, correct?

Also, with scipy 1.12, all of them are 15 mins?


It's likely much faster to debug in a Windows VM. If only to see if the issue is specific to the GH runners or windows in general.

@lucascolley lucascolley changed the title BUG: Windows Github Actions build times jumping from 15m to 4h15m after update from Scipy 1.12.0 to Scipy 1.13.0 BUG: very long Windows GHA build in SciPy 1.13.0, using Qhull Apr 30, 2024
@lucascolley lucascolley changed the title BUG: very long Windows GHA build in SciPy 1.13.0, using Qhull BUG: spatial: very long Windows GHA build in SciPy 1.13.0, using Qhull Apr 30, 2024
@KelSolaar
Copy link
Author

Correct yes!

@ilayn
Copy link
Member

ilayn commented Apr 30, 2024

Is this a wheel install or you are building it yourself in the CI?

@KelSolaar
Copy link
Author

It is a wheel install, Scipy is one of our dependencies.

@lucascolley
Copy link
Member

Right. To clarify, the installation is fast, but the test suite takes 4 hours.

@lucascolley lucascolley removed the Build issues Issues with building from source, including different choices of architecture, compilers and OS label Apr 30, 2024
@lucascolley lucascolley changed the title BUG: spatial: very long Windows GHA build in SciPy 1.13.0, using Qhull BUG: spatial: long hangs using Qhull in SciPy 1.13.0, in Colour CI Apr 30, 2024
@tylerjereddy
Copy link
Contributor

There was a substantial bump in vendored OpenBLAS version between 1.12.0 and 1.13.0 of course. I suppose that's also worth considering, especially since the 0.3.26.dev version specifically addressed a concurrency issue on Windows. Having a way to bisect with a simple reproducer as Evgeni alluded to would of course be great.

@rgommers
Copy link
Member

Agreed that OpenBLAS version is the prime candidate. We have nightlies that were just upgraded from 0.3.26.dev to 0.3.27, with a known regression that can cause this kind of performance regression fixed. Worth trying with those: https://anaconda.org/scientific-python-nightly-wheels/scipy

@KelSolaar
Copy link
Author

Hello,

I built a repro here: https://github.com/KelSolaar/qhull-regression

Doing so I narrowed down the issue to what I think is the concurrent use of QHull during distributed unit tests with pytest-xdist. If I disable the concurrent execution, the tests are fast again but as soon as the is_within_mesh_volume is used by 2 workers on Windows, this is where things start to dramatically slow down.

@rgommers
Copy link
Member

rgommers commented Jun 23, 2024

Ah, that is really helpful. That should then have been fixed by gh-20619 (EDIT: or a newer OpenBLAS version in the 1.14.0 wheels). Could you try with 1.14.0rc2 to see if the problem goes away?

@KelSolaar
Copy link
Author

Are there any wheels on test.pypi or alike?

@rgommers
Copy link
Member

They're on PyPI, like all beta/RC releases usually are. You can install with pip install scipy --pre.

@KelSolaar
Copy link
Author

@nickodell
Copy link
Member

nickodell commented Jul 5, 2024

Hello @KelSolaar,

Would you mind trying setting the environment variable OMP_NUM_THREADS=1 within your Windows CI?

Let me explain why I think this might help.

I tried running your example on my Windows computer, and I noticed something odd - I had limited pytest to 2 cores with pytest -n 2, but Task Manager said it was using 100% of available CPU. This can sometimes indicate that a program is using thread parallelism in addition to the process-level parallelism provided by xdist.

To check this, I tried setting OMP_NUM_THREADS to 1, and it ran much faster:

(qhull) PS C:\Users\Nick\programming\qhull-regression> pytest -n 2
================================================= test session starts =================================================
platform win32 -- Python 3.12.4, pytest-8.2.2, pluggy-1.5.0
rootdir: C:\Users\Nick\programming\qhull-regression
configfile: pyproject.toml
plugins: xdist-3.6.1
2 workers [2 items]
..                                                                                                               [100%]
================================================= 2 passed in 13.69s ==================================================
(qhull) PS C:\Users\Nick\programming\qhull-regression> $env:OMP_NUM_THREADS=1
(qhull) PS C:\Users\Nick\programming\qhull-regression> pytest -n 2
================================================= test session starts =================================================
platform win32 -- Python 3.12.4, pytest-8.2.2, pluggy-1.5.0
rootdir: C:\Users\Nick\programming\qhull-regression
configfile: pyproject.toml
plugins: xdist-3.6.1
2 workers [2 items]
..                                                                                                               [100%]
================================================== 2 passed in 4.98s ==================================================
(qhull) PS C:\Users\Nick\programming\qhull-regression>

Alternatively, if setting OMP_NUM_THREADS to 1 globally is not an option, an alternative would be to use threadpoolctl, which is a library which can use context managers to turn thread-level parallelism on and off within particular context managers. This could be used to selectively disable parallelism within particular tests, or even within parts of your library where parallelism is known to not help.

Example of one of your tests adapted to use threadpoolctl:

from qhull_regression.common import *
from threadpoolctl import threadpool_limits


def test_RGB_colourspace_volume_coverage_MonteCarlo():
    with threadpool_limits(limits=1, user_api='blas'):
        np.testing.assert_allclose(
            RGB_colourspace_volume_coverage_MonteCarlo(int(10e3)),
            47.25326992,
        )

By changing both of your tests to limit thread-level parallelism, I get similar performance to the OMP_NUM_THREADS option.

Could you try one of those, and report whether that helps?

@KelSolaar
Copy link
Author

Hey @nickodell,

Thanks a ton! Confirming that setting the environment variable worked:

Status
Success
Total duration
[1m 34s](https://github.com/KelSolaar/qhull-regression/actions/runs/9814531816/usage)

I think that in our use case, it makes sense to disable OMP parallelism globally for the unit tests. I'm not sure where this information should go and if it is already in the documentation but if not, it might be worth considering adding somewhere as this is a gold nugget!

@nickodell
Copy link
Member

I think that in our use case, it makes sense to disable OMP parallelism globally for the unit tests. I'm not sure where this information should go and if it is already in the documentation but if not, it might be worth considering adding somewhere as this is a gold nugget!

I'm not sure if it's documented anywhere. I know about it because I was bitten by this on a previous project. The closest docs page is probably Debugging linear algebra related issues. There's also the upstream documentation, although that's a little light on why you'd want to restrict the thread count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.spatial
Projects
None yet
Development

No branches or pull requests

7 participants