New error using the new update. #359

jaoeded · 2021-09-13T09:41:25Z

[2021-09-13 11:39:11,114] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.5.1, git-hash=unknown, git-branch=unknown
[2021-09-13 11:39:11,216] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed groups
[2021-09-13 11:39:11,216] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed model parallel group with size 1
[2021-09-13 11:39:11,216] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed expert parallel group with size 1
[2021-09-13 11:39:11,217] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert data parallel process group with ranks: [0]
[2021-09-13 11:39:11,217] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [0]
[2021-09-13 11:39:11,240] [INFO] [engine.py:198:init] DeepSpeed Flops Profiler Enabled: False
Traceback (most recent call last):
File "train_dalle.py", line 497, in
config_params=deepspeed_config,
File "/home/valterjordan/DALLE-pytorch/dalle_pytorch/distributed_backends/distributed_backend.py", line 152, in distribute
**kwargs,
File "/home/valterjordan/DALLE-pytorch/dalle_pytorch/distributed_backends/deepspeed_backend.py", line 162, in _distribute
**kwargs,
File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/deepspeed/init.py", line 141, in initialize
config_params=config_params)
File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 204, in init
self.training_dataloader = self.deepspeed_io(training_data)
File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1188, in deepspeed_io
data_parallel_rank=data_parallel_rank)
File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/deepspeed/runtime/dataloader.py", line 52, in init
rank=data_parallel_rank)
File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/torch/utils/data/distributed.py", line 87, in init
self.num_samples = math.ceil(len(self.dataset) / self.num_replicas) # type: ignore
TypeError: object of type 'Processor' has no len()

lucidrains · 2021-09-13T15:11:34Z

should be fixed in the latest commit!

jaoeded · 2021-09-13T18:34:09Z

Issue still priests but disappears downgrading webdatasets really strange.

afiaka87 · 2021-09-21T19:34:11Z

Still getting the issue as well. @jaoeded what version of webdataset did you use to fix it?

afiaka87 · 2021-09-25T01:59:41Z

I guess we can just monkey patch the __len__ dunder method...
#366

afiaka87 · 2021-09-25T12:22:12Z

microsoft/DeepSpeed#1371

rom1504 · 2021-09-25T16:59:09Z

just a note:
deepspeed supports IterableDataset https://github.com/microsoft/DeepSpeed/blob/86dd6a6484a4c3aa8a04fc7e7e6c67652b09dad5/deepspeed/runtime/engine.py#L1141
webdataset exposes an IterableDataset https://github.com/webdataset/webdataset

iterable dataset do not have a __len__ method, only __iter__

So I'm wondering if the problem could be coming from the use of webdataset and deepspeed here that would be incorrect in some way ?

rom1504 · 2021-09-25T17:00:14Z

for example this call https://github.com/lucidrains/DALLE-pytorch/blob/main/train_dalle.py#L392 to torch.utils.data.distributed.DistributedSampler seems suspicious and unrelated with deepspeed

afiaka87 · 2021-09-25T21:18:27Z

@rom1504 All I know is that maintaining a DeepSpeed compatible codebase has been an utter nightmare since day one. Interop with deep speed breaks something fairly frequently. As such my motivation to fix these things "properly" is pretty much non-existent.

afiaka87 · 2021-09-25T21:19:40Z

I agree that it likely has something to do with the data sampler; but I didn't want to just remove that as it seems to be explicitly for handling the multi-GPU scenario with DeepSpeed I believe? @janEbert would love some background on this if you have the time.

rom1504 · 2021-09-25T21:24:47Z

I guess @robvanvolt might be interested to have a look at this code since it's related with his deepspeed issue

janEbert · 2021-09-25T22:51:13Z

Hey, the DistributedSampler is indeed unrelated to DeepSpeed, it's actually for Horovod. I should have documented this, sorry about that.

I have an intuition about the issue here (something about the dataset returned by DeepSpeed colliding with WebDatasets) but need to see whether I can fix it tomorrow (I'm also not very up-to-date with the code base). Otherwise after Wednesday is the earliest time.

Sorry for the brevity and thanks for the ping!

janEbert · 2021-09-26T10:05:58Z

I had a quick look at this; I'm not yet familiar with WebDatasets, so maybe you can answer this more easily.
Why is it important to use a wds.WebLoader? Can't we pass the wds.WebDataset to distr_backend.distribute to let DeepSpeed handle data loading with its distr_dl (and removing this if-branch accordingly)?

Sorry, I know you probably answered this already during all the testing and implementing.

janEbert · 2021-09-26T11:15:43Z

The problem is that PyTorch's sampling strategy does not work with IterableDatasets; see the open issue here: pytorch/pytorch#28743. DeepSpeed tries to apply a DistributedSampler when it is passed a dataset in initialization, and WebDatasets are IterableDatasets.
So to fix this, the only change we need is to pass None here when ENABLE_WEDATASET is True, foregoing the (anyway redundant) DeepSpeed sampling wrapper that causes the error.

I tried this on the supercomputer and cannot get a single iteration from the WebLoader, so please someone else test it as well.

Causes errors due to PyTorch's `torch.data.utils.DistributedSampler` not being applicable to `torch.data.utils.IterableDataset`s (which WebDatasets are implementing). Fix lucidrains#359.

Wrapping causes errors due to PyTorch's `torch.data.utils.DistributedSampler` not being applicable to `torch.data.utils.IterableDataset`s (which WebDatasets are implementing). Fix lucidrains#359.

jaoeded · 2021-09-26T12:34:42Z

Is this issue safe to close now?

jaoeded · 2021-09-26T12:44:18Z

@janEbert @afiaka87 @rom1504 I'm closing this issue janEbert's pr seems to have fixed it. reopen if needed.

jaoeded · 2021-09-26T12:48:31Z

One last note feel free to try the pr yourselves. if it does not work feel free to reopen it worked for me.

Wrapping causes errors due to PyTorch's `torch.data.utils.DistributedSampler` not being applicable to `torch.data.utils.IterableDataset`s (which WebDatasets are implementing). Fix #359.

This was referenced Sep 25, 2021

(deepspeed/wds) - patch dummy __len__ on wds dataset #366

Closed

[REQUEST] Remove dataset length requirements microsoft/DeepSpeed#1371

Closed

janEbert mentioned this issue Sep 26, 2021

Do not try to wrap WebDatasets with DeepSpeed #367

Merged

jaoeded closed this as completed Sep 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New error using the new update. #359

New error using the new update. #359

jaoeded commented Sep 13, 2021

lucidrains commented Sep 13, 2021

jaoeded commented Sep 13, 2021

afiaka87 commented Sep 21, 2021

afiaka87 commented Sep 25, 2021 •

edited

Loading

afiaka87 commented Sep 25, 2021

rom1504 commented Sep 25, 2021

rom1504 commented Sep 25, 2021

afiaka87 commented Sep 25, 2021

afiaka87 commented Sep 25, 2021 •

edited

Loading

rom1504 commented Sep 25, 2021

janEbert commented Sep 25, 2021 •

edited

Loading

janEbert commented Sep 26, 2021

janEbert commented Sep 26, 2021 •

edited

Loading

jaoeded commented Sep 26, 2021

jaoeded commented Sep 26, 2021

jaoeded commented Sep 26, 2021

New error using the new update. #359

New error using the new update. #359

Comments

jaoeded commented Sep 13, 2021

lucidrains commented Sep 13, 2021

jaoeded commented Sep 13, 2021

afiaka87 commented Sep 21, 2021

afiaka87 commented Sep 25, 2021 • edited Loading

afiaka87 commented Sep 25, 2021

rom1504 commented Sep 25, 2021

rom1504 commented Sep 25, 2021

afiaka87 commented Sep 25, 2021

afiaka87 commented Sep 25, 2021 • edited Loading

rom1504 commented Sep 25, 2021

janEbert commented Sep 25, 2021 • edited Loading

janEbert commented Sep 26, 2021

janEbert commented Sep 26, 2021 • edited Loading

jaoeded commented Sep 26, 2021

jaoeded commented Sep 26, 2021

jaoeded commented Sep 26, 2021

afiaka87 commented Sep 25, 2021 •

edited

Loading

afiaka87 commented Sep 25, 2021 •

edited

Loading

janEbert commented Sep 25, 2021 •

edited

Loading

janEbert commented Sep 26, 2021 •

edited

Loading