Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Remove dataset length requirements #1371

Closed
robvanvolt opened this issue Sep 15, 2021 · 3 comments
Closed

[REQUEST] Remove dataset length requirements #1371

robvanvolt opened this issue Sep 15, 2021 · 3 comments
Labels
enhancement New feature or request

Comments

@robvanvolt
Copy link

robvanvolt commented Sep 15, 2021

I'm always frustrated when I use iterable datasets without a length attribute in deepspeed, because they are not supported officially. Unfortunately, the length attribute got removed form the WebDataset format (https://github.com/webdataset/webdataset). This is a highly desirable format, because it lets you stream the data without having to download the complete dataset from regular .tar or .tar.gz files.

I would like to implement an dataset-length agnostic method to deepspeed, so iterable datasets like webdataset would work "out of the box", especially in distributed working environments.

I used a custom wrapper class to asign a random high number to the length attribute before, but this is more like a "quick and dirty" solution to the problem and only worked reliably on non-distributed working.

If you could provide me with some ressources or help in implementing a dataset-size agnostic version of deepspeed, I would really appreciate it!

@robvanvolt robvanvolt added the enhancement New feature or request label Sep 15, 2021
@afiaka87
Copy link

lucidrains/DALLE-pytorch#359

Indeed; this is preventing us from using deepseed with webdataset. Any way around this?

@tjruwase
Copy link
Contributor

@robvanvolt, thanks for sharing this request. However, I don't fully understand what is required or what the current issue is with deepspeed. Can you provide more details or even better a toy example demonstrating the current limitation? Thanks.

@janEbert
Copy link
Contributor

janEbert commented Sep 26, 2021

This would (should) need to be solved from the PyTorch side, I think, as DeepSpeed uses PyTorch's torch.utils.data.DistributedSampler which does not support IterableDatasets (see pytorch/pytorch#28743).

Otherwise, DeepSpeed could go the same way as WebDataset and implement its own sampler for IterableDatasets.

@loadams loadams closed this as completed Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants