Fix multinode tests script #1631

fco-dv · 2021-02-11T11:25:14Z

Description: Try to fix the run_miltinode_tests_in_docker.sh

Add scripts parameters nnodes | nproc_per_node | gpu
Enable gpu tests
Put the docker rm steps at the end
Tests results on multiple configurations CPU and GPU(s) : GPUs tests are almost OK but should be investigated because the behavior is not identical when some tests are previsously deactivated ( see below the different configurations) ( thanks @sdesrozis for your help ! )

Run on a node with 2 GPUs

 

Script test configurations :

nnodes |  nproc_per_nodes | gpu(enabled)

 

2  |  1   |  0

 

=============================== warnings summary ===============================

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

  /workspace/ignite/metrics/precision.py:29: RuntimeWarning: Precision/Recall metrics do not work in distributed setting when average=False and is_multilabel=True. Results are not reduced across computing devices. Computed result corresponds to the local rank's (single process) result.

    RuntimeWarning,

 

-- Docs: https://docs.pytest.org/en/stable/warnings.html

 

================= 24 passed, 26 skipped, 2 warnings in 32.53s ==================

 

2  |  4   |  0   ( default)

 

=============================== warnings summary ===============================

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

  /workspace/ignite/metrics/precision.py:29: RuntimeWarning: Precision/Recall metrics do not work in distributed setting when average=False and is_multilabel=True. Results are not reduced across computing devices. Computed result corresponds to the local rank's (single process) result.

    RuntimeWarning,

 

-- Docs: https://docs.pytest.org/en/stable/warnings.html

=========================== short test summary info ============================

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

FAILED tests/ignite/metrics/test_loss.py::test_multinode_distrib_cpu - Assert...

FAILED tests/ignite/metrics/test_loss.py::test_multinode_distrib_cpu - Assert...

FAILED tests/ignite/metrics/test_loss.py::test_multinode_distrib_cpu - Assert...

FAILED tests/ignite/metrics/test_loss.py::test_multinode_distrib_cpu - Assert...

 

======= 8 failed, 88 passed, 98 skipped, 8 warnings in 441.54s (0:07:21) =======

 

 

4  |  4  |  0

 

=============================== warnings summary ===============================

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

  /workspace/ignite/metrics/precision.py:29: RuntimeWarning: Precision/Recall metrics do not work in distributed setting when average=False and is_multilabel=True. Results are not reduced across computing devices. Computed result corresponds to the local rank's (single process) result.

    RuntimeWarning,

 

-- Docs: https://docs.pytest.org/en/stable/warnings.html

=========================== short test summary info ============================

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

 

======= 4 failed, 92 passed, 98 skipped, 8 warnings in 654.51s (0:10:54) =======

 

 

GPU configuration ( 1 node < - > 1 gpu ):

 

2  |  1   | 1

 

After the first test passed  tests/ignite/contrib/engines/test_common.py::test_multinode_distrib_gpu , all the remaining tests failed :

 

================== 1 passed, 26 skipped, 23 errors in 45.69s ===================

 

        if _default_pg is not None:

>           raise RuntimeError("trying to initialize the default process group "

                               "twice!")

E           RuntimeError: trying to initialize the default process group twice!

 

/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:387: RuntimeError

 

 

When ignoring some these tests with : -k ‘not test_common’ , the remaining tests are OK

 

=============================== warnings summary ===============================

tests/ignite/metrics/test_precision.py::test_multinode_distrib_gpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_gpu

  /workspace/ignite/metrics/precision.py:29: RuntimeWarning: Precision/Recall metrics do not work in distributed setting when average=False and is_multilabel=True. Results are not reduced across computing devices. Computed result corresponds to the local rank's (single process) result.

    RuntimeWarning,

 

-- Docs: https://docs.pytest.org/en/stable/warnings.html

 

================= 23 passed, 25 skipped, 2 warnings in 38.83s ==================

Check list:

New tests are added (if a new feature is added)
New doc strings: description and/or example code are in RST format
Documentation is updated (if required)

…ersion

…as parameters

…nd of the script

vfdev-5

Thanks for the PR @fco-dv ! Looks good to me 👍
Let me also try it on my infra.

Do you think we should integrate to Circle CI ?

vfdev-5 · 2021-02-11T11:35:41Z

As for failing tests, let decorate tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu as we allow it to fail.

What are the problems with FAILED tests/ignite/metrics/test_loss.py::test_multinode_distrib_cpu ?

…st_multinode_distrib_cpu

fco-dv · 2021-02-11T12:44:17Z

for FAILED tests/ignite/metrics/test_loss.py::test_multinode_distrib_cpu , it's a float precision error, here are the details :

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/ignite/metrics/test_loss.py:109: in _test_distrib_compute_on_criterion
    _test("cpu")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>       assert_almost_equal(res, true_loss_value.item())
E       AssertionError: 
E       Arrays are not almost equal to 7 decimals
E        ACTUAL: 1.1080787181854248
E        DESIRED: 1.108078956604004

tests/ignite/metrics/test_loss.py:107: AssertionError

vfdev-5 · 2021-02-11T13:05:13Z

for FAILED tests/ignite/metrics/test_loss.py::test_multinode_distrib_cpu , it's a float precision error, here are the details :

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/ignite/metrics/test_loss.py:109: in _test_distrib_compute_on_criterion
    _test("cpu")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>       assert_almost_equal(res, true_loss_value.item())
E       AssertionError: 
E       Arrays are not almost equal to 7 decimals
E        ACTUAL: 1.1080787181854248
E        DESIRED: 1.108078956604004

tests/ignite/metrics/test_loss.py:107: AssertionError

OK, it is a precision issue, let's add a tol option as done for XLA.

sdesrozis · 2021-02-11T20:36:03Z

@fco-dv have you tackled the issue concerning test_common that we have been facing recently ?

fco-dv · 2021-02-11T22:14:17Z

@sdesrozis not yet , my guess is that the process group is not destroyed at the end of test_common training step, but not quite sure how to address that...

vfdev-5 · 2021-02-11T22:15:50Z

@fco-dv what's the issue it is about ?

fco-dv · 2021-02-11T22:46:19Z

@vfdev-5 when running with gpu enabled, test_common PASSED but others FAILED due to RuntimeError: trying to initialize the default process group twice! , but when skipping this test, others PASSED.

vfdev-5 · 2021-02-11T22:53:09Z

@vfdev-5 when running with gpu enabled, test_common PASSED but others FAILED due to RuntimeError: trying to initialize the default process group twice! , but when skipping this test, others PASSED.

Maybe, we can try to do same as here : https://github.com/pytorch/ignite/blob/master/tests/ignite/conftest.py#L104 ?

fco-dv · 2021-02-12T14:52:44Z

Managed to fix gpu tests with :
fco-dv@697dfae#diff-939e5dec7f6dffa9b3cd5721a69781e2cbc81b052c5676d8a1a0667915f808f5
There is certainly a more elegant way to do this ...
test configuration :
2 | 1 | 1

=============================== warnings summary ===============================
tests/ignite/metrics/test_precision.py::test_multinode_distrib_gpu
tests/ignite/metrics/test_recall.py::test_multinode_distrib_gpu
  /workspace/ignite/metrics/precision.py:29: RuntimeWarning: Precision/Recall metrics do not work in distributed setting when average=False and is_multilabel=True. Results are not reduced across computing devices. Computed result corresponds to the local rank's (single process) result.
    RuntimeWarning,

-- Docs: https://docs.pytest.org/en/stable/warnings.html
================= 24 passed, 26 skipped, 2 warnings in 37.98s ==================

… twice!"

fco-dv · 2021-02-13T12:32:03Z

Seems ok now for:

Default conf : 2 | 4 | 0

=============================== warnings summary ===============================
tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu
tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu
tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu
tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu
tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu
tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu
tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu
tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu
  /workspace/ignite/metrics/precision.py:29: RuntimeWarning: Precision/Recall metrics do not work in distributed setting when average=False and is_multilabel=True. Results are not reduced across computing devices. Computed result corresponds to the local rank's (single process) result.
    RuntimeWarning,

-- Docs: https://docs.pytest.org/en/stable/warnings.html
====== 92 passed, 98 skipped, 4 xfailed, 8 warnings in 507.07s (0:08:27) =======

and with gpu : 2 | 1 | 1

=============================== warnings summary ===============================
tests/ignite/metrics/test_precision.py::test_multinode_distrib_gpu
tests/ignite/metrics/test_recall.py::test_multinode_distrib_gpu
  /workspace/ignite/metrics/precision.py:29: RuntimeWarning: Precision/Recall metrics do not work in distributed setting when average=False and is_multilabel=True. Results are not reduced across computing devices. Computed result corresponds to the local rank's (single process) result.
    RuntimeWarning,

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============ 24 passed, 26 skipped, 2 warnings in 77.77s (0:01:17) =============

@vfdev-5 for the CI integration would you like me to create another PR or continue on this one ? thanks!

vfdev-5 · 2021-02-13T12:38:12Z

@fco-dv Thanks ! Let's merge it like that and for Circle CI, I'll enable it on PRs and let's intergrate it in another PR.

tests/ignite/metrics/test_loss.py

vfdev-5

Thanks for the PR @fco-dv !

vfdev-5 · 2021-02-14T17:19:54Z

@fco-dv tests are failing : https://github.com/pytorch/ignite/pull/1631/checks?check_run_id=1898395835

* fix run_multinode_tests_in_docker.sh : run tests with docker python version * add missing modules * build an image with test env and add 'nnodes' 'nproc_per_node' 'gpu' as parameters * pytorch#1615 : change nproc_per_node default to 4 * pytorch#1615 : fix for gpu enabled tests / container rm step at the end of the script * add xfail decorator for tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu * fix script gpu_options * add default tol=1e-6 for _test_distrib_compute_on_criterion * fix for "RuntimeError: trying to initialize the default process group twice!" * tolerance for test_multinode_distrib_cpu case only * fix assert None error * autopep8 fix Co-authored-by: vfdev <[email protected]> Co-authored-by: Sylvain Desroziers <[email protected]> Co-authored-by: fco-dv <[email protected]>

@deprecated

* Recall/Precision metrics for ddp : average == false and multilabel == true * For v0.4.3 - Add more versionadded, versionchanged tags - Change v0.5… (#1612) * For v0.4.3 - Add more versionadded, versionchanged tags - Change v0.5.0 to v0.4.3 * Update ignite/contrib/metrics/regression/canberra_metric.py Co-authored-by: vfdev <[email protected]> * Update ignite/contrib/metrics/regression/manhattan_distance.py Co-authored-by: vfdev <[email protected]> * Update ignite/contrib/metrics/regression/r2_score.py Co-authored-by: vfdev <[email protected]> * Update ignite/handlers/checkpoint.py Co-authored-by: vfdev <[email protected]> * address PR comments Co-authored-by: vfdev <[email protected]> * added TimeLimit handler with its test and doc (#1611) * added TimeLimit handler with its test and doc * fixed documentation * fixed docstring and formatting * flake8 fix trailing whitespace :) * modified class logger , default value and tests * changed rounding to nearest integer * tests refactored , docs modified * fixed default value , removed global logger * fixing formatting * Added versionadded * added test for engine termination Co-authored-by: vfdev <[email protected]> * Update handlers to use setup_logger (#1617) * Fixes #1614 - Updated handlers EarlyStopping and TerminateOnNan - Replaced `logging.getLogger` with `setup_logger` in the mentioned handlers * Updated `TimeLimit` handler. Replaced use of `logger.getLogger` with `setup_logger` from `ignite.utils` Co-authored-by: Pradyumna Rahul K <[email protected]> Co-authored-by: Sylvain Desroziers <[email protected]> * Managing Deprecation using decorators (#1585) * Starter code for managing deprecation * Make functions deprecated using the `@deprecated` decorator * Add arguments to the @deprecated decorator to customize it for each function * Improve `@deprecated` decorator and add tests * Replaced the `raise` keyword with added `warnings` * Added tests several possibilities of the decorator usage * Removing the test deprecation to check tests * Add static typing, fix mypy errors * Make `@deprecated` to raise Exceptions or Warning * The `@deprecated` decorator will now always emit warning unless explicitly asked to raise an Exception * Fix mypy errors * Fix mypy errors (hopefully) * Fix the test `test_deprecated_setup_any_logging` * Change the test to work with the `@deprecated` decorator * Change to snake_case, handle mypy ignores * Improve Type Annotations * Update common.py * For v0.4.3 - Add more versionadded, versionchanged tags - Change v0.5… (#1612) * For v0.4.3 - Add more versionadded, versionchanged tags - Change v0.5.0 to v0.4.3 * Update ignite/contrib/metrics/regression/canberra_metric.py Co-authored-by: vfdev <[email protected]> * Update ignite/contrib/metrics/regression/manhattan_distance.py Co-authored-by: vfdev <[email protected]> * Update ignite/contrib/metrics/regression/r2_score.py Co-authored-by: vfdev <[email protected]> * Update ignite/handlers/checkpoint.py Co-authored-by: vfdev <[email protected]> * address PR comments Co-authored-by: vfdev <[email protected]> * `version` -> version Co-authored-by: vfdev <[email protected]> Co-authored-by: François COKELAER <[email protected]> Co-authored-by: Sylvain Desroziers <[email protected]> * Create documentation.md * Distributed tests on Windows should be skipped until fixed. (#1620) * modified CONTRIBUTING.md * bash instead of sh * Added Checkpoint.get_default_score_fn (#1621) * Added Checkpoint.get_default_score_fn to simplify best_model_handler creation * Added score_sign argument * Updated docs * Update about.rst * Update pre-commit hooks and CONTRIBUTING.md (#1622) * Change pre-commit config and CONTRIBUTING.md - Update hook versions - Remove seed-isort-config - Add black profile to isort * Fix files based on new pre-commit config * Add meaningful exclusions to prettier - Also update actions workflow files to match local pre-commit * added requirements.txt and updated readme.md (#1624) * added requirements.txt and updated readme.md * Update examples/contrib/cifar10/README.md Co-authored-by: vfdev <[email protected]> * Update examples/contrib/cifar10/requirements.txt Co-authored-by: vfdev <[email protected]> Co-authored-by: vfdev <[email protected]> * Replace relative paths with raw.githubusercontent (#1629) * Updated cifar10 example (#1632) * Updates for cifar10 example * Updates for cifar10 example * More updates * Updated code * Fixed code-formatting * Fixed failling CI and typos for cifar10 examples (#1633) * Updates for cifar10 example * Updates for cifar10 example * More updates * Updated code * Fixed code-formatting * Fixed typo and failing CI * Fixed hvd spawn fail and better synced qat code * Removed temporary hack to install pth 1.7.1 (#1638) - updated default pth image for gpu tests - updated TORCH_CUDA_ARCH_LIST - fixed /merge -> /head in trigger ci pipeline * [docker] Pillow -> Pillow-SIMD (#1509) (#1639) * [docker] Pillow -> Pillow-SIMD (#1509) * [docker] Pillow -> Pillow-SIMD * replace pillow with pillow-simd in base docker files * chore(docker): apt-get autoremove after pillow-simd installation * apt-get install at once, autoremove g++ * install g++ in pillow installation layer Co-authored-by: Sylvain Desroziers <[email protected]> * Fix g++ install issue Co-authored-by: Jeff Yang <[email protected]> Co-authored-by: Sylvain Desroziers <[email protected]> * Fix multinode tests script (#1631) * fix run_multinode_tests_in_docker.sh : run tests with docker python version * add missing modules * build an image with test env and add 'nnodes' 'nproc_per_node' 'gpu' as parameters * #1615 : change nproc_per_node default to 4 * #1615 : fix for gpu enabled tests / container rm step at the end of the script * add xfail decorator for tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu * fix script gpu_options * add default tol=1e-6 for _test_distrib_compute_on_criterion * fix for "RuntimeError: trying to initialize the default process group twice!" * tolerance for test_multinode_distrib_cpu case only * fix assert None error * autopep8 fix Co-authored-by: vfdev <[email protected]> Co-authored-by: Sylvain Desroziers <[email protected]> Co-authored-by: fco-dv <[email protected]> * remove warning for average=False and is_multilabel=True * update docstring and {precision, recall} tests according to test_multilabel_input_NCHW Co-authored-by: vfdev <[email protected]> Co-authored-by: Ahmed Omar <[email protected]> Co-authored-by: Pradyumna Rahul <[email protected]> Co-authored-by: Pradyumna Rahul K <[email protected]> Co-authored-by: Sylvain Desroziers <[email protected]> Co-authored-by: Devanshu Shah <[email protected]> Co-authored-by: Debojyoti Chakraborty <[email protected]> Co-authored-by: Jeff Yang <[email protected]> Co-authored-by: fco-dv <[email protected]>

fco-dv added 5 commits February 5, 2021 17:37

fix run_multinode_tests_in_docker.sh : run tests with docker python v…

b04768d

…ersion

add missing modules

08ed03e

build an image with test env and add 'nnodes' 'nproc_per_node' 'gpu' …

5467bc4

…as parameters

pytorch#1615 : change nproc_per_node default to 4

3d8f232

pytorch#1615 : fix for gpu enabled tests / container rm step at the e…

27c7662

…nd of the script

vfdev-5 reviewed Feb 11, 2021

View reviewed changes

add xfail decorator for tests/ignite/engine/test_deterministic.py::te…

868cae6

…st_multinode_distrib_cpu

vfdev-5 and others added 2 commits February 11, 2021 14:05

Merge branch 'master' into fix_multinode_tests_script

a7bccab

Merge branch 'master' into fix_multinode_tests_script

b760424

fix script gpu_options

01f0a6c

fco-dv added 2 commits February 13, 2021 13:08

add default tol=1e-6 for _test_distrib_compute_on_criterion

a2b5425

fix for "RuntimeError: trying to initialize the default process group…

ee823e6

… twice!"

Merge branch 'master' into fix_multinode_tests_script

6b689db

vfdev-5 approved these changes Feb 13, 2021

View reviewed changes

vfdev-5 requested changes Feb 13, 2021

View reviewed changes

tests/ignite/metrics/test_loss.py Outdated Show resolved Hide resolved

fco-dv and others added 2 commits February 14, 2021 18:01

tolerance for test_multinode_distrib_cpu case only

0d95c6d

Merge branch 'master' into fix_multinode_tests_script

79ae230

vfdev-5 approved these changes Feb 14, 2021

View reviewed changes

fco-dv and others added 2 commits February 14, 2021 19:08

fix assert None error

c62b74d

autopep8 fix

c7938cc

vfdev-5 merged commit 1588081 into pytorch:master Feb 14, 2021

fco-dv deleted the fix_multinode_tests_script branch February 14, 2021 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multinode tests script #1631

Fix multinode tests script #1631

fco-dv commented Feb 11, 2021

vfdev-5 left a comment •

edited

Loading

vfdev-5 commented Feb 11, 2021

fco-dv commented Feb 11, 2021

vfdev-5 commented Feb 11, 2021

sdesrozis commented Feb 11, 2021

fco-dv commented Feb 11, 2021

vfdev-5 commented Feb 11, 2021

fco-dv commented Feb 11, 2021

vfdev-5 commented Feb 11, 2021

fco-dv commented Feb 12, 2021

fco-dv commented Feb 13, 2021

vfdev-5 commented Feb 13, 2021

vfdev-5 left a comment

vfdev-5 commented Feb 14, 2021

Fix multinode tests script #1631

Fix multinode tests script #1631

Conversation

fco-dv commented Feb 11, 2021

vfdev-5 left a comment • edited Loading

Choose a reason for hiding this comment

vfdev-5 commented Feb 11, 2021

fco-dv commented Feb 11, 2021

vfdev-5 commented Feb 11, 2021

sdesrozis commented Feb 11, 2021

fco-dv commented Feb 11, 2021

vfdev-5 commented Feb 11, 2021

fco-dv commented Feb 11, 2021

vfdev-5 commented Feb 11, 2021

fco-dv commented Feb 12, 2021

fco-dv commented Feb 13, 2021

vfdev-5 commented Feb 13, 2021

vfdev-5 left a comment

Choose a reason for hiding this comment

vfdev-5 commented Feb 14, 2021

vfdev-5 left a comment •

edited

Loading