Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added distributed tests with Horovod and XLA for early_stopping #2165

Merged
merged 12 commits into from
Aug 19, 2021

Conversation

Ishan-Kumar2
Copy link
Contributor

@Ishan-Kumar2 Ishan-Kumar2 commented Aug 16, 2021

Addresses #2101

Description:
Added distributed tests with Horovod and XLA for early_stopping.

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@sdesrozis
Copy link
Contributor

@Ishan-Kumar2 Thank you very much for this PR !!

However, it seems that it does not work perfectly. Feel free to ask help if needed.

@Ishan-Kumar2
Copy link
Contributor Author

Hi @sdesrozis the hvd tests are working now (here).
The xla tests are still failing because in _test_distrib_integration_engine_early_stopping there is an Accuracy metric which is passed to device. So should I change the device to cpu when it is xla for this test?

@sdesrozis
Copy link
Contributor

sdesrozis commented Aug 17, 2021

@Ishan-Kumar2 it seems there are still some code formatting errors https://github.com/pytorch/ignite/pull/2165/checks?check_run_id=3346963218#step:8:46

Anyway, you are right, some metrics don’t work for the moment on TPU. You have to skip tests using metrics with xla. To do so, please see

if device.type != "xla":

@Ishan-Kumar2
Copy link
Contributor Author

@sdesrozis I have fixed it, now both the hvd tests and xla tests are passing.
Let me know if they are correct.
I noticed that in the tests for metrics the inputs are sliced using the rank of the device, so that all devices are testing on a different set of values like here. Is the same method also applicable here?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Aug 17, 2021

I noticed that in the tests for metrics the inputs are sliced using the rank of the device, so that all devices are testing on a different set of values like here

@Ishan-Kumar2 yes, we are doing that for metrics to explicitly check that metrics are correctly computed across devices. In case of early stopping, all process should have the same data and technically should all require to stop if metrics are not improving... So, no need to make the data rank-dependent, I'd say.

@Ishan-Kumar2
Copy link
Contributor Author

Ishan-Kumar2 commented Aug 18, 2021

In case of early stopping, all process should have the same data and technically should all require to stop if metrics are not improving

Yup that makes sense.

In terms of the checks, I am not sure why some are failing (test_deterministic), they seem unrelated to this PR.

@Ishan-Kumar2
Copy link
Contributor Author

Ishan-Kumar2 commented Aug 18, 2021

@vfdev-5
On a slightly different issue, I noticed that in a lot of tests the way they are handled is that there is a helper function (_test) inside the test and a loop which calls it for different test cases for example here
Do you think it would be better to remove the loop and instead have the various values of tests as pytest parameters (pytest.mark.parameterize)?

For the above link test_binary_and_multilabel_inputs this would look like

@pytest.mark.parametrize("y_pred, y, batch_size",[
            (torch.randint(0, 2, size=(50,)).long(), torch.randint(0, 2, size=(50,)).long(), 1),
            ...
        ])
def test_binary_and_multilabel_inputs(y_pred, y, batch_size):
    ...
    assert isinstance(res, float)
    assert average_precision_score(np_y, np_y_pred) == pytest.approx(res)

@sdesrozis
Copy link
Contributor

I think a good way is to mimic what is done in others metrics' tests 😉

@sdesrozis
Copy link
Contributor

In terms of the checks, I am not sure why some are failing (test_deterministic), they seem unrelated to this PR.

@Ishan-Kumar2 It seems that we have some troubles with the PyTorch-Nigthly... We will investigate.

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Ishan-Kumar2
Let's merge this PR as failure is unrelated.

@vfdev-5 vfdev-5 merged commit 5d4f869 into pytorch:master Aug 19, 2021
@Ishan-Kumar2 Ishan-Kumar2 deleted the hvd_tests branch August 20, 2021 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants