[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 #37387

woshiyyya · 2023-07-13T06:32:31Z

Why are these changes needed?

LightningTrainer deepspeed ci test failed due to lightning upgrade from 2.0.4 to 2.0.5, which introduced a check on device ranks in DeepSpeedStrategy Link. This PR aims to address the incompatibility and fix the test.

Related issue number

Fix #37374

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: woshiyyya <[email protected]>

matthewdeng · 2023-07-13T21:16:35Z

python/ray/train/lightning/_lightning_utils.py

- # Device ranks have already been specified in RayEnvironment
- # Clear parallel_devices to skip deepspeed local rank checks
- self.parallel_devices = []
+ self.parallel_devices = list(range(torch.cuda.device_count()))


Would something like this make sense? Not sure if the devices indices are always starting from 0.

Suggested change

self.parallel_devices = list(range(torch.cuda.device_count()))

devices = train.torch.get_device()

if not isinstance(devices, list):

devices = [devices]

self.parallel_devices = [d.index for d in devices if d.index is not None]

The parallel_devices here need to be set to all the cuda device ids for current node. train.torch.get_device() only return the one for the current worker.

For example, if we have two workers each with 1 GPU

worker_0: CUDA_VISIBLE_DEVICES=2,3, train.torch.get_device()=torch.device(cuda:0)

worker_1: CUDA_VISIBLE_DEVICES=2,3, train.torch.get_device()=torch.device(cuda:1)

parallel_devices here should be [torch.device(cuda:0), torch.device(cuda:1)]

Lightning added a special check for deepspeed, which requires the indices of parallel_devices equals to list(range(len(parallel_devices)))

Signed-off-by: woshiyyya <[email protected]>

…ect#37387)

…37408)

…ect#37387)

…ect#37387) Signed-off-by: NripeshN <[email protected]>

…ect#37387) Signed-off-by: harborn <[email protected]>

…ect#37387)

…ect#37387) Signed-off-by: e428265 <[email protected]>

…ect#37387) Signed-off-by: Victor <[email protected]>

fix

127e9dd

Signed-off-by: woshiyyya <[email protected]>

woshiyyya mentioned this pull request Jul 13, 2023

[Train] test_lightning_deepspeed is failing #37374

Closed

bveeramani approved these changes Jul 13, 2023

View reviewed changes

woshiyyya marked this pull request as ready for review July 13, 2023 15:38

woshiyyya assigned matthewdeng Jul 13, 2023

woshiyyya added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. release-blocker P0 Issue that blocks the release labels Jul 13, 2023

fix by correctly init parallel_devices

939dbe7

Signed-off-by: woshiyyya <[email protected]>

matthewdeng reviewed Jul 13, 2023

View reviewed changes

convert int to torch.device

3bbd737

Signed-off-by: woshiyyya <[email protected]>

matthewdeng approved these changes Jul 14, 2023

View reviewed changes

matthewdeng merged commit c4b21a9 into ray-project:master Jul 14, 2023
2 checks passed

woshiyyya added a commit to woshiyyya/ray that referenced this pull request Jul 14, 2023

[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 (ray-proj…

b4b8e56

…ect#37387)

woshiyyya mentioned this pull request Jul 14, 2023

[pick][Train] Fix Deepspeed device ranks check in Lightning 2.0.5 (#37387) #37408

Merged

8 tasks

bveeramani pushed a commit that referenced this pull request Jul 14, 2023

[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 (#37387) (#…

ce8f5d3

…37408)

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

Bhav00 pushed a commit to Bhav00/ray that referenced this pull request Jul 24, 2023

[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 (ray-proj…

0f1626b

…ect#37387)

NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023

[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 (ray-proj…

fc715da

…ect#37387) Signed-off-by: NripeshN <[email protected]>

harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023

[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 (ray-proj…

f0854df

…ect#37387) Signed-off-by: harborn <[email protected]>

harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023

[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 (ray-proj…

f702edc

…ect#37387)

arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023

[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 (ray-proj…

46ae31f

…ect#37387) Signed-off-by: e428265 <[email protected]>

vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023

[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 (ray-proj…

cf01ec2

…ect#37387) Signed-off-by: Victor <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 #37387

[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 #37387

woshiyyya commented Jul 13, 2023 •

edited

Loading

matthewdeng Jul 13, 2023

woshiyyya Jul 14, 2023 •

edited

Loading

- self.parallel_devices = list(range(torch.cuda.device_count()))
+ devices = train.torch.get_device()
+ if not isinstance(devices, list):
+ devices = [devices]
+ self.parallel_devices = [d.index for d in devices if d.index is not None]

[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 #37387

[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 #37387

Conversation

woshiyyya commented Jul 13, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

matthewdeng Jul 13, 2023

Choose a reason for hiding this comment

woshiyyya Jul 14, 2023 • edited Loading

Choose a reason for hiding this comment

woshiyyya commented Jul 13, 2023 •

edited

Loading

woshiyyya Jul 14, 2023 •

edited

Loading