-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New error using the new update. #359
Comments
should be fixed in the latest commit! |
Issue still priests but disappears downgrading webdatasets really strange. |
Still getting the issue as well. @jaoeded what version of webdataset did you use to fix it? |
I guess we can just monkey patch the |
just a note: iterable dataset do not have a So I'm wondering if the problem could be coming from the use of webdataset and deepspeed here that would be incorrect in some way ? |
for example this call https://github.com/lucidrains/DALLE-pytorch/blob/main/train_dalle.py#L392 to torch.utils.data.distributed.DistributedSampler seems suspicious and unrelated with deepspeed |
@rom1504 All I know is that maintaining a DeepSpeed compatible codebase has been an utter nightmare since day one. Interop with deep speed breaks something fairly frequently. As such my motivation to fix these things "properly" is pretty much non-existent. |
I agree that it likely has something to do with the data sampler; but I didn't want to just remove that as it seems to be explicitly for handling the multi-GPU scenario with DeepSpeed I believe? @janEbert would love some background on this if you have the time. |
I guess @robvanvolt might be interested to have a look at this code since it's related with his deepspeed issue |
Hey, the I have an intuition about the issue here (something about the dataset returned by DeepSpeed colliding with WebDatasets) but need to see whether I can fix it tomorrow (I'm also not very up-to-date with the code base). Otherwise after Wednesday is the earliest time. Sorry for the brevity and thanks for the ping! |
I had a quick look at this; I'm not yet familiar with WebDatasets, so maybe you can answer this more easily. Sorry, I know you probably answered this already during all the testing and implementing. |
The problem is that PyTorch's sampling strategy does not work with I tried this on the supercomputer and cannot get a single iteration from the |
Causes errors due to PyTorch's `torch.data.utils.DistributedSampler` not being applicable to `torch.data.utils.IterableDataset`s (which WebDatasets are implementing). Fix lucidrains#359.
Wrapping causes errors due to PyTorch's `torch.data.utils.DistributedSampler` not being applicable to `torch.data.utils.IterableDataset`s (which WebDatasets are implementing). Fix lucidrains#359.
Is this issue safe to close now? |
One last note feel free to try the pr yourselves. if it does not work feel free to reopen it worked for me. |
Wrapping causes errors due to PyTorch's `torch.data.utils.DistributedSampler` not being applicable to `torch.data.utils.IterableDataset`s (which WebDatasets are implementing). Fix #359.
[2021-09-13 11:39:11,114] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.5.1, git-hash=unknown, git-branch=unknown
[2021-09-13 11:39:11,216] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed groups
[2021-09-13 11:39:11,216] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed model parallel group with size 1
[2021-09-13 11:39:11,216] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed expert parallel group with size 1
[2021-09-13 11:39:11,217] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert data parallel process group with ranks: [0]
[2021-09-13 11:39:11,217] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [0]
[2021-09-13 11:39:11,240] [INFO] [engine.py:198:init] DeepSpeed Flops Profiler Enabled: False
Traceback (most recent call last):
File "train_dalle.py", line 497, in
config_params=deepspeed_config,
File "/home/valterjordan/DALLE-pytorch/dalle_pytorch/distributed_backends/distributed_backend.py", line 152, in distribute
**kwargs,
File "/home/valterjordan/DALLE-pytorch/dalle_pytorch/distributed_backends/deepspeed_backend.py", line 162, in _distribute
**kwargs,
File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/deepspeed/init.py", line 141, in initialize
config_params=config_params)
File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 204, in init
self.training_dataloader = self.deepspeed_io(training_data)
File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1188, in deepspeed_io
data_parallel_rank=data_parallel_rank)
File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/deepspeed/runtime/dataloader.py", line 52, in init
rank=data_parallel_rank)
File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/torch/utils/data/distributed.py", line 87, in init
self.num_samples = math.ceil(len(self.dataset) / self.num_replicas) # type: ignore
TypeError: object of type 'Processor' has no len()
The text was updated successfully, but these errors were encountered: