Added distributed tests with Horovod and XLA for early_stopping #2165

Ishan-Kumar2 · 2021-08-16T11:17:20Z

Addresses #2101

Description:
Added distributed tests with Horovod and XLA for early_stopping.

Check list:

New tests are added (if a new feature is added)
New doc strings: description and/or example code are in RST format
Documentation is updated (if required)

sdesrozis · 2021-08-16T15:32:32Z

@Ishan-Kumar2 Thank you very much for this PR !!

However, it seems that it does not work perfectly. Feel free to ask help if needed.

Ishan-Kumar2 · 2021-08-17T05:05:03Z

Hi @sdesrozis the hvd tests are working now (here).
The xla tests are still failing because in _test_distrib_integration_engine_early_stopping there is an Accuracy metric which is passed to device. So should I change the device to cpu when it is xla for this test?

sdesrozis · 2021-08-17T07:34:45Z

@Ishan-Kumar2 it seems there are still some code formatting errors https://github.com/pytorch/ignite/pull/2165/checks?check_run_id=3346963218#step:8:46

Anyway, you are right, some metrics don’t work for the moment on TPU. You have to skip tests using metrics with xla. To do so, please see

ignite/tests/ignite/metrics/test_accuracy.py

Line 543 in 0e0200d

if device.type != "xla":

Ishan-Kumar2 · 2021-08-17T13:02:07Z

@sdesrozis I have fixed it, now both the hvd tests and xla tests are passing.
Let me know if they are correct.
I noticed that in the tests for metrics the inputs are sliced using the rank of the device, so that all devices are testing on a different set of values like here. Is the same method also applicable here?

vfdev-5 · 2021-08-17T19:43:48Z

I noticed that in the tests for metrics the inputs are sliced using the rank of the device, so that all devices are testing on a different set of values like here

@Ishan-Kumar2 yes, we are doing that for metrics to explicitly check that metrics are correctly computed across devices. In case of early stopping, all process should have the same data and technically should all require to stop if metrics are not improving... So, no need to make the data rank-dependent, I'd say.

Ishan-Kumar2 · 2021-08-18T06:33:13Z

In case of early stopping, all process should have the same data and technically should all require to stop if metrics are not improving

Yup that makes sense.

In terms of the checks, I am not sure why some are failing (test_deterministic), they seem unrelated to this PR.

Ishan-Kumar2 · 2021-08-18T11:27:49Z

@vfdev-5
On a slightly different issue, I noticed that in a lot of tests the way they are handled is that there is a helper function (_test) inside the test and a loop which calls it for different test cases for example here
Do you think it would be better to remove the loop and instead have the various values of tests as pytest parameters (pytest.mark.parameterize)?

For the above link test_binary_and_multilabel_inputs this would look like

@pytest.mark.parametrize("y_pred, y, batch_size",[
            (torch.randint(0, 2, size=(50,)).long(), torch.randint(0, 2, size=(50,)).long(), 1),
            ...
        ])
def test_binary_and_multilabel_inputs(y_pred, y, batch_size):
    ...
    assert isinstance(res, float)
    assert average_precision_score(np_y, np_y_pred) == pytest.approx(res)

sdesrozis · 2021-08-18T18:56:58Z

I think a good way is to mimic what is done in others metrics' tests 😉

sdesrozis · 2021-08-19T13:27:17Z

In terms of the checks, I am not sure why some are failing (test_deterministic), they seem unrelated to this PR.

@Ishan-Kumar2 It seems that we have some troubles with the PyTorch-Nigthly... We will investigate.

vfdev-5

LGTM, thanks @Ishan-Kumar2
Let's merge this PR as failure is unrelated.

Ishan-Kumar2 added 2 commits August 15, 2021 22:27

added hvd tests for early stopping

d6f94e4

added xla tests for early stopping

a8ed84c

Ishan-Kumar2 and others added 6 commits August 17, 2021 00:17

added hvd device

8bb74af

autopep8 fix

96fe069

changed dist to idist

1bf2042

replace dist with idist

42c1a9c

convert dist to idist

c032205

Merge branch 'pytorch:master' into hvd_tests

ab22b57

Ishan-Kumar2 and others added 3 commits August 17, 2021 17:24

sync master

8bf29dd

changed device to CPU for metrics

760288c

autopep8 fix

020a087

Merge branch 'master' into hvd_tests

7890a70

vfdev-5 approved these changes Aug 19, 2021

View reviewed changes

vfdev-5 merged commit 5d4f869 into pytorch:master Aug 19, 2021

Ishan-Kumar2 deleted the hvd_tests branch August 20, 2021 04:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added distributed tests with Horovod and XLA for early_stopping #2165

Added distributed tests with Horovod and XLA for early_stopping #2165

Ishan-Kumar2 commented Aug 16, 2021 •

edited

Loading

sdesrozis commented Aug 16, 2021

Ishan-Kumar2 commented Aug 17, 2021

sdesrozis commented Aug 17, 2021 •

edited

Loading

Ishan-Kumar2 commented Aug 17, 2021

vfdev-5 commented Aug 17, 2021

Ishan-Kumar2 commented Aug 18, 2021 •

edited

Loading

Ishan-Kumar2 commented Aug 18, 2021 •

edited

Loading

sdesrozis commented Aug 18, 2021

sdesrozis commented Aug 19, 2021

vfdev-5 left a comment •

edited

Loading

Added distributed tests with Horovod and XLA for early_stopping #2165

Added distributed tests with Horovod and XLA for early_stopping #2165

Conversation

Ishan-Kumar2 commented Aug 16, 2021 • edited Loading

sdesrozis commented Aug 16, 2021

Ishan-Kumar2 commented Aug 17, 2021

sdesrozis commented Aug 17, 2021 • edited Loading

Ishan-Kumar2 commented Aug 17, 2021

vfdev-5 commented Aug 17, 2021

Ishan-Kumar2 commented Aug 18, 2021 • edited Loading

Ishan-Kumar2 commented Aug 18, 2021 • edited Loading

sdesrozis commented Aug 18, 2021

sdesrozis commented Aug 19, 2021

vfdev-5 left a comment • edited Loading

Choose a reason for hiding this comment

Ishan-Kumar2 commented Aug 16, 2021 •

edited

Loading

sdesrozis commented Aug 17, 2021 •

edited

Loading

Ishan-Kumar2 commented Aug 18, 2021 •

edited

Loading

Ishan-Kumar2 commented Aug 18, 2021 •

edited

Loading

vfdev-5 left a comment •

edited

Loading