Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 #37387

Merged

Conversation

woshiyyya
Copy link
Member

@woshiyyya woshiyyya commented Jul 13, 2023

Why are these changes needed?

LightningTrainer deepspeed ci test failed due to lightning upgrade from 2.0.4 to 2.0.5, which introduced a check on device ranks in DeepSpeedStrategy Link. This PR aims to address the incompatibility and fix the test.

Related issue number

Fix #37374

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: woshiyyya <[email protected]>
@woshiyyya woshiyyya marked this pull request as ready for review July 13, 2023 15:38
@woshiyyya woshiyyya added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. release-blocker P0 Issue that blocks the release labels Jul 13, 2023
# Device ranks have already been specified in RayEnvironment
# Clear parallel_devices to skip deepspeed local rank checks
self.parallel_devices = []
self.parallel_devices = list(range(torch.cuda.device_count()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this make sense? Not sure if the devices indices are always starting from 0.

Suggested change
self.parallel_devices = list(range(torch.cuda.device_count()))
devices = train.torch.get_device()
if not isinstance(devices, list):
devices = [devices]
self.parallel_devices = [d.index for d in devices if d.index is not None]

Copy link
Member Author

@woshiyyya woshiyyya Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parallel_devices here need to be set to all the cuda device ids for current node. train.torch.get_device() only return the one for the current worker.

For example, if we have two workers each with 1 GPU

  • worker_0: CUDA_VISIBLE_DEVICES=2,3, train.torch.get_device()=torch.device(cuda:0)
  • worker_1: CUDA_VISIBLE_DEVICES=2,3, train.torch.get_device()=torch.device(cuda:1)

parallel_devices here should be [torch.device(cuda:0), torch.device(cuda:1)]

Lightning added a special check for deepspeed, which requires the indices of parallel_devices equals to list(range(len(parallel_devices)))

@matthewdeng matthewdeng merged commit c4b21a9 into ray-project:master Jul 14, 2023
2 checks passed
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker P0 Issue that blocks the release tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Train] test_lightning_deepspeed is failing
3 participants