[Feature Request] ShuffleDataset #45114

fmassa · 2020-09-22T09:44:34Z

🚀 Feature

Provide a ShuffleDataset IterableDataset that can be composed with a user-defined dataset for performing random shuffling.

The simplest way to implement this would be to follow the shuffling strategy from TensorFlow, which keeps a buffer of dataset elements, and once the buffer is full, randomly pop one element of the buffer and replace the popped element with a new element from the dataset.

The user could then implement shuffling on his own dataset as follows:

dataset = MyCustomIterableDataset(...)
dataset = ShuffleDataset(dataset, buffer_size=1000)
for data in dataset:
    # order of elements in dataset is shuffled compared to initial dataset

We might want to expose a generator argument to the dataset, in order to control how to seed the data. Care must be taken on this because torch.Generator is not picklable (although random.Random seems to be), so it wouldn't play well with our DataLoader implementation, which serializes the dataset via pickle to send it to multiple processes.

Motivation

IterableDataset in PyTorch is a family of Datasets that implement the __iter__ method, in contrast to Map-Style datasets which implements __getitem__ and __len__.
All sampling strategies in PyTorch are focused on Map-Style datasets via the Sampler abstractions. Those abstractions do not work on IterableDatasets because they require knowing the __len__ of the dataset, and assume that random access on the dataset is available.

As of now, the user is required to implement dataset shuffling on his own, over and over. We should provide some basic shuffling strategies in PyTorch for this, given how widespread is operation is.

Pitch

Provide a ShuffleDataset class, inheriting from IterableDataset, which gives a basic random shuffling strategy based on a buffer_size.

Alternatives

There are currently no alternatives natively available in PyTorch as of now.

Additional context

This has been requested / discussed multiple times in the past, see #28743 for one example.

cc @ssnl @VitalyFedyunin

The text was updated successfully, but these errors were encountered:

loubnabnl · 2022-05-18T08:19:04Z

Hi @fmassa, I noticed that IterableDataset in torch 1.9 supports shuffling through IterableDataset().shuffle, but it’s not the case for the latest version of torch. I was curious why it was removed and if there is an alternative.

fmassa added module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 22, 2020

fmassa assigned gchanan Sep 22, 2020

ejguan linked a pull request Sep 24, 2020 that will close this issue

Add ShuffleDataset with buffer #45290

Closed

VitalyFedyunin closed this as completed Nov 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] ShuffleDataset #45114

[Feature Request] ShuffleDataset #45114

fmassa commented Sep 22, 2020 •

edited by pytorch-probot bot

Loading

loubnabnl commented May 18, 2022

[Feature Request] ShuffleDataset #45114

[Feature Request] ShuffleDataset #45114

Comments

fmassa commented Sep 22, 2020 • edited by pytorch-probot bot Loading

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

loubnabnl commented May 18, 2022

fmassa commented Sep 22, 2020 •

edited by pytorch-probot bot

Loading