Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] ShuffleDataset #45114

Closed
fmassa opened this issue Sep 22, 2020 · 1 comment
Closed

[Feature Request] ShuffleDataset #45114

fmassa opened this issue Sep 22, 2020 · 1 comment
Assignees
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@fmassa
Copy link
Member

fmassa commented Sep 22, 2020

🚀 Feature

Provide a ShuffleDataset IterableDataset that can be composed with a user-defined dataset for performing random shuffling.

The simplest way to implement this would be to follow the shuffling strategy from TensorFlow, which keeps a buffer of dataset elements, and once the buffer is full, randomly pop one element of the buffer and replace the popped element with a new element from the dataset.

The user could then implement shuffling on his own dataset as follows:

dataset = MyCustomIterableDataset(...)
dataset = ShuffleDataset(dataset, buffer_size=1000)
for data in dataset:
    # order of elements in dataset is shuffled compared to initial dataset

We might want to expose a generator argument to the dataset, in order to control how to seed the data. Care must be taken on this because torch.Generator is not picklable (although random.Random seems to be), so it wouldn't play well with our DataLoader implementation, which serializes the dataset via pickle to send it to multiple processes.

Motivation

IterableDataset in PyTorch is a family of Datasets that implement the __iter__ method, in contrast to Map-Style datasets which implements __getitem__ and __len__.
All sampling strategies in PyTorch are focused on Map-Style datasets via the Sampler abstractions. Those abstractions do not work on IterableDatasets because they require knowing the __len__ of the dataset, and assume that random access on the dataset is available.

As of now, the user is required to implement dataset shuffling on his own, over and over. We should provide some basic shuffling strategies in PyTorch for this, given how widespread is operation is.

Pitch

Provide a ShuffleDataset class, inheriting from IterableDataset, which gives a basic random shuffling strategy based on a buffer_size.

Alternatives

There are currently no alternatives natively available in PyTorch as of now.

Additional context

This has been requested / discussed multiple times in the past, see #28743 for one example.

cc @ssnl @VitalyFedyunin

@fmassa fmassa added module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 22, 2020
@ejguan ejguan linked a pull request Sep 24, 2020 that will close this issue
@loubnabnl
Copy link

Hi @fmassa, I noticed that IterableDataset in torch 1.9 supports shuffling through IterableDataset().shuffle, but it’s not the case for the latest version of torch. I was curious why it was removed and if there is an alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants