-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST] Remove dataset length requirements #1371
Comments
Indeed; this is preventing us from using deepseed with webdataset. Any way around this? |
@robvanvolt, thanks for sharing this request. However, I don't fully understand what is required or what the current issue is with deepspeed. Can you provide more details or even better a toy example demonstrating the current limitation? Thanks. |
This would (should) need to be solved from the PyTorch side, I think, as DeepSpeed uses PyTorch's Otherwise, DeepSpeed could go the same way as WebDataset and implement its own sampler for |
I'm always frustrated when I use iterable datasets without a length attribute in deepspeed, because they are not supported officially. Unfortunately, the length attribute got removed form the WebDataset format (https://github.com/webdataset/webdataset). This is a highly desirable format, because it lets you stream the data without having to download the complete dataset from regular .tar or .tar.gz files.
I would like to implement an dataset-length agnostic method to deepspeed, so iterable datasets like webdataset would work "out of the box", especially in distributed working environments.
I used a custom wrapper class to asign a random high number to the length attribute before, but this is more like a "quick and dirty" solution to the problem and only worked reliably on non-distributed working.
If you could provide me with some ressources or help in implementing a dataset-size agnostic version of deepspeed, I would really appreciate it!
The text was updated successfully, but these errors were encountered: