Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Fixed HorovodBackend to automatically detect network interfaces #19533

Merged
merged 8 commits into from
Oct 21, 2021

Conversation

tgaddair
Copy link
Contributor

@tgaddair tgaddair commented Oct 20, 2021

Why are these changes needed?

Fixes API incompatibilities in HorovodBackend to properly detect NICs when not provided by the user. Without these changes, Horovod in Ray Train will not work unless the user knows in advance which NICs they wish to use for training.

Related issue number

Fixes #19516.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@tgaddair tgaddair marked this pull request as ready for review October 20, 2021 00:16
@amogkam amogkam removed their assignment Oct 20, 2021
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this fix!

Would you be able to share a little more details around the expected/observed behavior for the nics detection? It looks like this might have been caused by #18728 but we didn't catch it in the tests, so I'd love to be able to add a simple test to make sure we're able to catch any potential regressions in the future!

python/ray/train/backends/horovod.py Outdated Show resolved Hide resolved
Comment on lines +90 to +105
@dataclass
class HorovodWorkerWrapper:
w: Worker

@property
def execute(self):
w = self.w

class ExecuteHandle:
def remote(self, func, *args, **kwargs):
_ = None
return w.actor._BaseWorkerMixin__execute.remote(
func, _, *args, **kwargs)

return ExecuteHandle()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, wondering if we could do something simpler like:

@dataclass
class HorovodWorkerWrapper:
    w: Worker

    @property
    execute = self.w.actor._BaseWorkerMixin__execute

Copy link
Contributor Author

@tgaddair tgaddair Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, not quite because Horovod's current API assumes that an additional parameter will be passed in to the remote function containing the class instance. This is why we have the _ = None line. I plan to make some changes to Horovod and potentially move more of this code into Ray Train to avoid this hack, but for now this is what we need to do to make it work.

@tgaddair
Copy link
Contributor Author

Hey @matthewdeng, thanks for the review. The issue here was the the HorovodConfig in Ray Train is being used in lieu of the Settings object in horovod.ray, however, prior to this change it did not have the necessary properties to meet the expected interface. When the nics param is None and the user is running on multiple nodes, Horovod will attempt to obtain the network interfaces from each of the workers, but without some of these properties, it will fail.

the secret property in particular was causing several runtime failures. Without this, the network interface detection logic will not be able to trust the messages coming from the workers.

Hope that makes sense, happy to provide more context as needed.

Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgaddair gotcha, sounds like this was missed on our end because we weren't testing the multi-node case. @amogkam can you help merge this after the docs lint issue is fixed?

@amogkam
Copy link
Contributor

amogkam commented Oct 21, 2021

Failing tests are unrelated, going to merge.

@amogkam amogkam merged commit c6e2161 into ray-project:master Oct 21, 2021
@tgaddair tgaddair deleted the hvd-train branch October 21, 2021 16:39
matthewdeng added a commit that referenced this pull request Oct 21, 2021
…es (#19533)

* Moved Horovod into package

* Move in Ludwig fix

* Undo git mv

* Cleanup

* Cleanup

* flake8

* Update python/ray/train/backends/horovod.py

Co-authored-by: matthewdeng <[email protected]>

* Whitespace

Co-authored-by: matthewdeng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug][Train] Network interface discovery broken for HorovodBackend
3 participants