Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multinode tests script #1631

Merged
merged 16 commits into from
Feb 14, 2021
Merged

Conversation

fco-dv
Copy link
Contributor

@fco-dv fco-dv commented Feb 11, 2021

Fixes #1627

Description: Try to fix the run_miltinode_tests_in_docker.sh

  • Add scripts parameters nnodes | nproc_per_node | gpu
  • Enable gpu tests
  • Put the docker rm steps at the end
  • Tests results on multiple configurations CPU and GPU(s) : GPUs tests are almost OK but should be investigated because the behavior is not identical when some tests are previsously deactivated ( see below the different configurations) ( thanks @sdesrozis for your help ! )
Run on a node with 2 GPUs

 

Script test configurations :

nnodes |  nproc_per_nodes | gpu(enabled)

 

2  |  1   |  0

 

=============================== warnings summary ===============================

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

  /workspace/ignite/metrics/precision.py:29: RuntimeWarning: Precision/Recall metrics do not work in distributed setting when average=False and is_multilabel=True. Results are not reduced across computing devices. Computed result corresponds to the local rank's (single process) result.

    RuntimeWarning,

 

-- Docs: https://docs.pytest.org/en/stable/warnings.html

 

================= 24 passed, 26 skipped, 2 warnings in 32.53s ==================

 

2  |  4   |  0   ( default)

 

=============================== warnings summary ===============================

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

  /workspace/ignite/metrics/precision.py:29: RuntimeWarning: Precision/Recall metrics do not work in distributed setting when average=False and is_multilabel=True. Results are not reduced across computing devices. Computed result corresponds to the local rank's (single process) result.

    RuntimeWarning,

 

-- Docs: https://docs.pytest.org/en/stable/warnings.html

=========================== short test summary info ============================

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

FAILED tests/ignite/metrics/test_loss.py::test_multinode_distrib_cpu - Assert...

FAILED tests/ignite/metrics/test_loss.py::test_multinode_distrib_cpu - Assert...

FAILED tests/ignite/metrics/test_loss.py::test_multinode_distrib_cpu - Assert...

FAILED tests/ignite/metrics/test_loss.py::test_multinode_distrib_cpu - Assert...

 

======= 8 failed, 88 passed, 98 skipped, 8 warnings in 441.54s (0:07:21) =======

 

 

4  |  4  |  0

 

=============================== warnings summary ===============================

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu

  /workspace/ignite/metrics/precision.py:29: RuntimeWarning: Precision/Recall metrics do not work in distributed setting when average=False and is_multilabel=True. Results are not reduced across computing devices. Computed result corresponds to the local rank's (single process) result.

    RuntimeWarning,

 

-- Docs: https://docs.pytest.org/en/stable/warnings.html

=========================== short test summary info ============================

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

FAILED tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

 

======= 4 failed, 92 passed, 98 skipped, 8 warnings in 654.51s (0:10:54) =======

 

 

GPU configuration ( 1 node < - > 1 gpu ):

 

2  |  1   | 1

 

After the first test passed  tests/ignite/contrib/engines/test_common.py::test_multinode_distrib_gpu , all the remaining tests failed :

 

================== 1 passed, 26 skipped, 23 errors in 45.69s ===================

 

        if _default_pg is not None:

>           raise RuntimeError("trying to initialize the default process group "

                               "twice!")

E           RuntimeError: trying to initialize the default process group twice!

 

/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:387: RuntimeError

 

 

When ignoring some these tests with : -k ‘not test_common’ , the remaining tests are OK

 

=============================== warnings summary ===============================

tests/ignite/metrics/test_precision.py::test_multinode_distrib_gpu

tests/ignite/metrics/test_recall.py::test_multinode_distrib_gpu

  /workspace/ignite/metrics/precision.py:29: RuntimeWarning: Precision/Recall metrics do not work in distributed setting when average=False and is_multilabel=True. Results are not reduced across computing devices. Computed result corresponds to the local rank's (single process) result.

    RuntimeWarning,

 

-- Docs: https://docs.pytest.org/en/stable/warnings.html

 

================= 23 passed, 25 skipped, 2 warnings in 38.83s ==================

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @fco-dv ! Looks good to me 👍
Let me also try it on my infra.

Do you think we should integrate to Circle CI ?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 11, 2021

As for failing tests, let decorate tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu as we allow it to fail.

What are the problems with FAILED tests/ignite/metrics/test_loss.py::test_multinode_distrib_cpu ?

@fco-dv
Copy link
Contributor Author

fco-dv commented Feb 11, 2021

for FAILED tests/ignite/metrics/test_loss.py::test_multinode_distrib_cpu , it's a float precision error, here are the details :

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/ignite/metrics/test_loss.py:109: in _test_distrib_compute_on_criterion
    _test("cpu")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>       assert_almost_equal(res, true_loss_value.item())
E       AssertionError: 
E       Arrays are not almost equal to 7 decimals
E        ACTUAL: 1.1080787181854248
E        DESIRED: 1.108078956604004

tests/ignite/metrics/test_loss.py:107: AssertionError

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 11, 2021

for FAILED tests/ignite/metrics/test_loss.py::test_multinode_distrib_cpu , it's a float precision error, here are the details :

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/ignite/metrics/test_loss.py:109: in _test_distrib_compute_on_criterion
    _test("cpu")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>       assert_almost_equal(res, true_loss_value.item())
E       AssertionError: 
E       Arrays are not almost equal to 7 decimals
E        ACTUAL: 1.1080787181854248
E        DESIRED: 1.108078956604004

tests/ignite/metrics/test_loss.py:107: AssertionError

OK, it is a precision issue, let's add a tol option as done for XLA.

@sdesrozis
Copy link
Contributor

@fco-dv have you tackled the issue concerning test_common that we have been facing recently ?

@fco-dv
Copy link
Contributor Author

fco-dv commented Feb 11, 2021

@sdesrozis not yet , my guess is that the process group is not destroyed at the end of test_common training step, but not quite sure how to address that...

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 11, 2021

@fco-dv what's the issue it is about ?

@fco-dv
Copy link
Contributor Author

fco-dv commented Feb 11, 2021

@vfdev-5 when running with gpu enabled, test_common PASSED but others FAILED due to RuntimeError: trying to initialize the default process group twice! , but when skipping this test, others PASSED.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 11, 2021

@vfdev-5 when running with gpu enabled, test_common PASSED but others FAILED due to RuntimeError: trying to initialize the default process group twice! , but when skipping this test, others PASSED.

Maybe, we can try to do same as here : https://github.com/pytorch/ignite/blob/master/tests/ignite/conftest.py#L104 ?

@fco-dv
Copy link
Contributor Author

fco-dv commented Feb 12, 2021

Managed to fix gpu tests with :
fco-dv@697dfae#diff-939e5dec7f6dffa9b3cd5721a69781e2cbc81b052c5676d8a1a0667915f808f5
There is certainly a more elegant way to do this ...
test configuration :
2 | 1 | 1

=============================== warnings summary ===============================
tests/ignite/metrics/test_precision.py::test_multinode_distrib_gpu
tests/ignite/metrics/test_recall.py::test_multinode_distrib_gpu
  /workspace/ignite/metrics/precision.py:29: RuntimeWarning: Precision/Recall metrics do not work in distributed setting when average=False and is_multilabel=True. Results are not reduced across computing devices. Computed result corresponds to the local rank's (single process) result.
    RuntimeWarning,

-- Docs: https://docs.pytest.org/en/stable/warnings.html
================= 24 passed, 26 skipped, 2 warnings in 37.98s ==================

@fco-dv
Copy link
Contributor Author

fco-dv commented Feb 13, 2021

Seems ok now for:

Default conf : 2 | 4 | 0

=============================== warnings summary ===============================
tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu
tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu
tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu
tests/ignite/metrics/test_precision.py::test_multinode_distrib_cpu
tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu
tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu
tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu
tests/ignite/metrics/test_recall.py::test_multinode_distrib_cpu
  /workspace/ignite/metrics/precision.py:29: RuntimeWarning: Precision/Recall metrics do not work in distributed setting when average=False and is_multilabel=True. Results are not reduced across computing devices. Computed result corresponds to the local rank's (single process) result.
    RuntimeWarning,

-- Docs: https://docs.pytest.org/en/stable/warnings.html
====== 92 passed, 98 skipped, 4 xfailed, 8 warnings in 507.07s (0:08:27) =======

and with gpu : 2 | 1 | 1

=============================== warnings summary ===============================
tests/ignite/metrics/test_precision.py::test_multinode_distrib_gpu
tests/ignite/metrics/test_recall.py::test_multinode_distrib_gpu
  /workspace/ignite/metrics/precision.py:29: RuntimeWarning: Precision/Recall metrics do not work in distributed setting when average=False and is_multilabel=True. Results are not reduced across computing devices. Computed result corresponds to the local rank's (single process) result.
    RuntimeWarning,

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============ 24 passed, 26 skipped, 2 warnings in 77.77s (0:01:17) =============

@vfdev-5 for the CI integration would you like me to create another PR or continue on this one ? thanks!

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 13, 2021

@fco-dv Thanks ! Let's merge it like that and for Circle CI, I'll enable it on PRs and let's intergrate it in another PR.

tests/ignite/metrics/test_loss.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @fco-dv !

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 14, 2021

@vfdev-5 vfdev-5 merged commit 1588081 into pytorch:master Feb 14, 2021
@fco-dv fco-dv deleted the fix_multinode_tests_script branch February 14, 2021 20:36
fco-dv added a commit to fco-dv/ignite that referenced this pull request Feb 15, 2021
* fix run_multinode_tests_in_docker.sh : run tests with docker python version

* add missing modules

* build an image with test env and add 'nnodes' 'nproc_per_node' 'gpu' as parameters

* pytorch#1615 : change nproc_per_node default to 4

* pytorch#1615 : fix for gpu enabled tests / container rm step at the end of the script

* add xfail decorator for tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

* fix script gpu_options

* add default tol=1e-6 for _test_distrib_compute_on_criterion

* fix for "RuntimeError: trying to initialize the default process group twice!"

* tolerance for test_multinode_distrib_cpu case only

* fix assert None error

* autopep8 fix

Co-authored-by: vfdev <[email protected]>
Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: fco-dv <[email protected]>
vfdev-5 added a commit that referenced this pull request Feb 21, 2021
* Recall/Precision metrics for ddp : average == false and multilabel == true

* For v0.4.3 - Add more versionadded, versionchanged tags - Change v0.5… (#1612)

* For v0.4.3 - Add more versionadded, versionchanged tags - Change v0.5.0 to v0.4.3

* Update ignite/contrib/metrics/regression/canberra_metric.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/contrib/metrics/regression/manhattan_distance.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/contrib/metrics/regression/r2_score.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/handlers/checkpoint.py

Co-authored-by: vfdev <[email protected]>

* address PR comments

Co-authored-by: vfdev <[email protected]>

* added TimeLimit handler with its test and doc (#1611)

* added TimeLimit handler with its test and doc

* fixed documentation

* fixed docstring and formatting

* flake8 fix trailing whitespace :)

* modified class logger , default value and tests

* changed rounding to nearest integer

* tests refactored , docs modified

* fixed default value , removed global logger

* fixing formatting

* Added versionadded

* added test for engine termination

Co-authored-by: vfdev <[email protected]>

* Update handlers to use setup_logger (#1617)

* Fixes #1614
- Updated handlers EarlyStopping and TerminateOnNan
- Replaced `logging.getLogger` with `setup_logger` in the mentioned handlers

* Updated `TimeLimit` handler.
Replaced use of `logger.getLogger` with `setup_logger` from `ignite.utils`

Co-authored-by: Pradyumna Rahul K <[email protected]>
Co-authored-by: Sylvain Desroziers <[email protected]>

* Managing Deprecation using decorators (#1585)

* Starter code for managing deprecation

* Make functions deprecated using the `@deprecated` decorator
* Add arguments to the @deprecated decorator to customize it for each function

* Improve `@deprecated` decorator and add tests

* Replaced the `raise` keyword with added `warnings`
* Added tests several possibilities of the decorator usage

* Removing the test deprecation to check tests

* Add static typing, fix mypy errors

* Make `@deprecated` to raise Exceptions or Warning

* The `@deprecated` decorator will now always emit warning unless explicitly asked to raise an Exception

* Fix mypy errors

* Fix mypy errors (hopefully)

* Fix the test `test_deprecated_setup_any_logging`

* Change the test to work with the `@deprecated` decorator

* Change to snake_case, handle mypy ignores

* Improve Type Annotations

* Update common.py

* For v0.4.3 - Add more versionadded, versionchanged tags - Change v0.5… (#1612)

* For v0.4.3 - Add more versionadded, versionchanged tags - Change v0.5.0 to v0.4.3

* Update ignite/contrib/metrics/regression/canberra_metric.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/contrib/metrics/regression/manhattan_distance.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/contrib/metrics/regression/r2_score.py

Co-authored-by: vfdev <[email protected]>

* Update ignite/handlers/checkpoint.py

Co-authored-by: vfdev <[email protected]>

* address PR comments

Co-authored-by: vfdev <[email protected]>

* `version` -> version

Co-authored-by: vfdev <[email protected]>
Co-authored-by: François COKELAER <[email protected]>
Co-authored-by: Sylvain Desroziers <[email protected]>

* Create documentation.md

* Distributed tests on Windows should be skipped until fixed. (#1620)

* modified CONTRIBUTING.md

* bash instead of sh

* Added Checkpoint.get_default_score_fn (#1621)

* Added Checkpoint.get_default_score_fn to simplify best_model_handler creation

* Added score_sign argument

* Updated docs

* Update about.rst

* Update pre-commit hooks and CONTRIBUTING.md (#1622)

* Change pre-commit config and CONTRIBUTING.md

- Update hook versions
- Remove seed-isort-config
- Add black profile to isort

* Fix files based on new pre-commit config

* Add meaningful exclusions to prettier

- Also update actions workflow files to match local pre-commit

* added requirements.txt and updated readme.md (#1624)

* added requirements.txt and updated readme.md

* Update examples/contrib/cifar10/README.md

Co-authored-by: vfdev <[email protected]>

* Update examples/contrib/cifar10/requirements.txt

Co-authored-by: vfdev <[email protected]>

Co-authored-by: vfdev <[email protected]>

* Replace relative paths with raw.githubusercontent (#1629)

* Updated cifar10 example (#1632)

* Updates for cifar10 example

* Updates for cifar10 example

* More updates

* Updated code

* Fixed code-formatting

* Fixed failling CI and typos for cifar10 examples (#1633)

* Updates for cifar10 example

* Updates for cifar10 example

* More updates

* Updated code

* Fixed code-formatting

* Fixed typo and failing CI

* Fixed hvd spawn fail and better synced qat code

* Removed temporary hack to install pth 1.7.1 (#1638)

- updated default pth image for gpu tests
- updated TORCH_CUDA_ARCH_LIST
- fixed /merge -> /head in trigger ci pipeline

* [docker] Pillow -> Pillow-SIMD (#1509) (#1639)

* [docker] Pillow -> Pillow-SIMD (#1509)

* [docker] Pillow -> Pillow-SIMD

* replace pillow with pillow-simd in base docker files

* chore(docker): apt-get autoremove after pillow-simd installation

* apt-get install at once, autoremove g++

* install g++ in pillow installation layer

Co-authored-by: Sylvain Desroziers <[email protected]>

* Fix g++ install issue

Co-authored-by: Jeff Yang <[email protected]>
Co-authored-by: Sylvain Desroziers <[email protected]>

* Fix multinode tests script (#1631)

* fix run_multinode_tests_in_docker.sh : run tests with docker python version

* add missing modules

* build an image with test env and add 'nnodes' 'nproc_per_node' 'gpu' as parameters

* #1615 : change nproc_per_node default to 4

* #1615 : fix for gpu enabled tests / container rm step at the end of the script

* add xfail decorator for tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu

* fix script gpu_options

* add default tol=1e-6 for _test_distrib_compute_on_criterion

* fix for "RuntimeError: trying to initialize the default process group twice!"

* tolerance for test_multinode_distrib_cpu case only

* fix assert None error

* autopep8 fix

Co-authored-by: vfdev <[email protected]>
Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: fco-dv <[email protected]>

* remove warning for average=False and is_multilabel=True

* update docstring and {precision, recall} tests according to test_multilabel_input_NCHW

Co-authored-by: vfdev <[email protected]>
Co-authored-by: Ahmed Omar <[email protected]>
Co-authored-by: Pradyumna Rahul <[email protected]>
Co-authored-by: Pradyumna Rahul K <[email protected]>
Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: Devanshu Shah <[email protected]>
Co-authored-by: Debojyoti Chakraborty <[email protected]>
Co-authored-by: Jeff Yang <[email protected]>
Co-authored-by: fco-dv <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

run_multinode_tests_in_docker.sh :FileNotFoundError: [Errno 2] No such file or directory: 'python3.6'
3 participants