Skip to content

Commit

Permalink
TST introducing the random_seed fixture (scikit-learn#22749)
Browse files Browse the repository at this point in the history
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Jérémie du Boisberranger <[email protected]>
  • Loading branch information
4 people committed Mar 14, 2022
1 parent 6904ae3 commit d3429ca
Show file tree
Hide file tree
Showing 7 changed files with 162 additions and 2 deletions.
10 changes: 10 additions & 0 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ jobs:
BLAS: 'mkl'
COVERAGE: 'true'
SHOW_SHORT_SUMMARY: 'true'
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '42' # default global random seed

# Check compilation with Ubuntu bionic 18.04 LTS and scipy from conda-forge
- template: build_tools/azure/posix.yml
Expand All @@ -168,6 +169,7 @@ jobs:
BLAS: 'openblas'
COVERAGE: 'false'
BUILD_WITH_ICC: 'false'
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '0' # non-default seed

- template: build_tools/azure/posix.yml
parameters:
Expand All @@ -190,6 +192,7 @@ jobs:
PANDAS_VERSION: 'none'
THREADPOOLCTL_VERSION: 'min'
COVERAGE: 'false'
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '1' # non-default seed
# Linux + Python 3.8 build with OpenBLAS
py38_conda_defaults_openblas:
DISTRIB: 'conda'
Expand All @@ -201,6 +204,7 @@ jobs:
MATPLOTLIB_VERSION: 'min'
THREADPOOLCTL_VERSION: '2.2.0'
SKLEARN_ENABLE_DEBUG_CYTHON_DIRECTIVES: '1'
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '2' # non-default seed
# Linux environment to test the latest available dependencies.
# It runs tests requiring lightgbm, pandas and PyAMG.
pylatest_pip_openblas_pandas:
Expand All @@ -210,6 +214,7 @@ jobs:
CHECK_PYTEST_SOFT_DEPENDENCY: 'true'
TEST_DOCSTRINGS: 'true'
CHECK_WARNINGS: 'true'
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '3' # non-default seed

- template: build_tools/azure/posix-docker.yml
parameters:
Expand All @@ -231,6 +236,7 @@ jobs:
PYTEST_XDIST_VERSION: 'none'
PYTEST_VERSION: 'min'
THREADPOOLCTL_VERSION: '2.2.0'
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '4' # non-default seed

- template: build_tools/azure/posix.yml
parameters:
Expand All @@ -249,12 +255,14 @@ jobs:
BLAS: 'mkl'
CONDA_CHANNEL: 'conda-forge'
CPU_COUNT: '3'
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '5' # non-default seed
pylatest_conda_mkl_no_openmp:
DISTRIB: 'conda'
BLAS: 'mkl'
SKLEARN_TEST_NO_OPENMP: 'true'
SKLEARN_SKIP_OPENMP_TEST: 'true'
CPU_COUNT: '3'
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '6' # non-default seed

- template: build_tools/azure/windows.yml
parameters:
Expand All @@ -280,6 +288,8 @@ jobs:
# Temporary fix for setuptools to use disutils from standard lib
# https://github.com/numpy/numpy/issues/17216
SETUPTOOLS_USE_DISTUTILS: 'stdlib'
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '7' # non-default seed
py38_pip_openblas_32bit:
PYTHON_VERSION: '3.8'
PYTHON_ARCH: '32'
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '8' # non-default seed
7 changes: 7 additions & 0 deletions build_tools/azure/test_script.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,13 @@ if [[ "$BUILD_WITH_ICC" == "true" ]]; then
source /opt/intel/oneapi/setvars.sh
fi

if [[ "$BUILD_REASON" == "Schedule" ]]; then
# Enable global random seed randomization to discover seed-sensitive tests
# only on nightly builds.
# https://scikit-learn.org/stable/computing/parallelism.html#environment-variables
export SKLEARN_TESTS_GLOBAL_RANDOM_SEED="any"
fi

mkdir -p $TEST_DIR
cp setup.cfg $TEST_DIR
cd $TEST_DIR
Expand Down
54 changes: 54 additions & 0 deletions doc/computing/parallelism.rst
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,60 @@ These environment variables should be set before importing scikit-learn.
Sets the seed of the global random generator when running the tests,
for reproducibility.

Note that scikit-learn tests are expected to run deterministically with
explicit seeding of their own independent RNG instances instead of relying
on the numpy or Python standard library RNG singletons to make sure that
test results are independent of the test execution order. However some
tests might forget to use explicit seeding and this variable is a way to
control the intial state of the aforementioned singletons.

:SKLEARN_TESTS_GLOBAL_RANDOM_SEED:

Controls the seeding of the random number generator used in tests that
rely on the `global_random_seed`` fixture.

All tests that use this fixture accept the contract that they should
deterministically pass for any seed value from 0 to 99 included.

If the SKLEARN_TESTS_GLOBAL_RANDOM_SEED environment variable is set to
"any" (which should be the case on nightly builds on the CI), the fixture
will choose an arbitrary seed in the above range (based on the BUILD_NUMBER
or the current day) and all fixtured tests will run for that specific seed.
The goal is to ensure that, over time, our CI will run all tests with
different seeds while keeping the test duration of a single run of the full
test suite limited. This will check that the assertions of tests
written to use this fixture are not dependent on a specific seed value.

The range of admissible seed values is limited to [0, 99] because it is
often not possible to write a test that can work for any possible seed and
we want to avoid having tests that randomly fail on the CI.

Valid values for SKLEARN_TESTS_GLOBAL_RANDOM_SEED:

- SKLEARN_TESTS_GLOBAL_RANDOM_SEED="42": run tests with a fixed seed of 42
- SKLEARN_TESTS_GLOBAL_RANDOM_SEED="40-42": run the tests with all seeds
between 40 and 42 included
- SKLEARN_TESTS_GLOBAL_RANDOM_SEED="any": run the tests with an arbitrary
seed selected between 0 and 99 included
- SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all": run the tests with all seeds
between 0 and 99 included

If the variable is not set, then 42 is used as the global seed in a
deterministic manner. This ensures that, by default, the scikit-learn test
suite is as deterministic as possible to avoid disrupting our friendly
third-party package maintainers. Similarly, this variable should not be set
in the CI config of pull-requests to make sure that our friendly
contributors are not the first people to encounter a seed-sensitivity
regression in a test unrelated to the changes of their own PR. Only the
scikit-learn maintainers who watch the results of the nightly builds are
expected to be annoyed by this.

When writing a new test function that uses this fixture, please use the
following command to make sure that it passes deterministically for all
admissible seeds on your local machine:

SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest -v -k test_your_test_name

:SKLEARN_SKIP_NETWORK_TESTS:

When this environment variable is set to a non zero value, the tests
Expand Down
4 changes: 4 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ addopts =
--doctest-modules
--disable-pytest-warnings
--color=yes
# Activate the plugin explicitly to ensure that the seed is reported
# correctly on the CI when running `pytest --pyargs sklearn` from the
# source folder.
-p sklearn.tests.random_seed
-rN

filterwarnings =
Expand Down
4 changes: 2 additions & 2 deletions sklearn/cluster/tests/test_k_means.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,9 +148,9 @@ def test_relocate_empty_clusters(array_constr):
"array_constr", [np.array, sp.csr_matrix], ids=["dense", "sparse"]
)
@pytest.mark.parametrize("tol", [1e-2, 1e-8, 1e-100, 0])
def test_kmeans_elkan_results(distribution, array_constr, tol):
def test_kmeans_elkan_results(distribution, array_constr, tol, global_random_seed):
# Check that results are identical between lloyd and elkan algorithms
rnd = np.random.RandomState(0)
rnd = np.random.RandomState(global_random_seed)
if distribution == "normal":
X = rnd.normal(size=(5000, 10))
else:
Expand Down
4 changes: 4 additions & 0 deletions sklearn/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@
from sklearn.datasets import fetch_rcv1


# This plugin is necessary to define the random seed fixture
pytest_plugins = ("sklearn.tests.random_seed",)


if parse_version(pytest.__version__) < parse_version(PYTEST_MIN_VERSION):
raise ImportError(
"Your version of pytest is too old, you should have "
Expand Down
81 changes: 81 additions & 0 deletions sklearn/tests/random_seed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
"""global_random_seed fixture
The goal of this fixture is to prevent tests that use it to be sensitive
to a specific seed value while still being deterministic by default.
See the documentation for the SKLEARN_TESTS_GLOBAL_RANDOM_SEED
variable for insrtuctions on how to use this fixture.
https://scikit-learn.org/dev/computing/parallelism.html#environment-variables
"""
import pytest
from os import environ
from random import Random


# Passes the main worker's random seeds to workers
class XDistHooks:
def pytest_configure_node(self, node) -> None:
random_seeds = node.config.getoption("random_seeds")
node.workerinput["random_seeds"] = random_seeds


def pytest_configure(config):
if config.pluginmanager.hasplugin("xdist"):
config.pluginmanager.register(XDistHooks())

RANDOM_SEED_RANGE = list(range(100)) # All seeds in [0, 99] should be valid.
random_seed_var = environ.get("SKLEARN_TESTS_GLOBAL_RANDOM_SEED")
if hasattr(config, "workinput"):
# Set worker random seed from seed generated from main process
random_seeds = config.workerinput["random_seeds"]
elif random_seed_var is None:
# This is the way.
random_seeds = [42]
elif random_seed_var == "any":
# Pick-up one seed at random in the range of admissible random seeds.
random_seeds = [Random().choice(RANDOM_SEED_RANGE)]
elif random_seed_var == "all":
random_seeds = RANDOM_SEED_RANGE
else:
if "-" in random_seed_var:
start, stop = random_seed_var.split("-")
random_seeds = list(range(int(start), int(stop) + 1))
else:
random_seeds = [int(random_seed_var)]

if min(random_seeds) < 0 or max(random_seeds) > 99:
raise ValueError(
"The value(s) of the environment variable "
"SKLEARN_TESTS_GLOBAL_RANDOM_SEED must be in the range [0, 99] "
f"(or 'any' or 'all'), got: {random_seed_var}"
)
config.option.random_seeds = random_seeds

class GlobalRandomSeedPlugin:
@pytest.fixture(params=random_seeds)
def global_random_seed(self, request):
"""Fixture to ask for a random yet controllable random seed.
All tests that use this fixture accept the contract that they should
deterministically pass for any seed value from 0 to 99 included.
See the documentation for the SKLEARN_TESTS_GLOBAL_RANDOM_SEED
variable for insrtuctions on how to use this fixture.
https://scikit-learn.org/dev/computing/parallelism.html#environment-variables
"""
yield request.param

config.pluginmanager.register(GlobalRandomSeedPlugin())


def pytest_report_header(config):
random_seed_var = environ.get("SKLEARN_TESTS_GLOBAL_RANDOM_SEED")
if random_seed_var == "any":
return [
"To reproduce this test run, set the following environment variable:",
f' SKLEARN_TESTS_GLOBAL_RANDOM_SEED="{config.option.random_seeds[0]}"',
"See: https://scikit-learn.org/dev/computing/parallelism.html"
"#environment-variables",
]

0 comments on commit d3429ca

Please sign in to comment.