You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
module: dataloaderRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Provide a ShuffleDataset IterableDataset that can be composed with a user-defined dataset for performing random shuffling.
The simplest way to implement this would be to follow the shuffling strategy from TensorFlow, which keeps a buffer of dataset elements, and once the buffer is full, randomly pop one element of the buffer and replace the popped element with a new element from the dataset.
The user could then implement shuffling on his own dataset as follows:
dataset=MyCustomIterableDataset(...)
dataset=ShuffleDataset(dataset, buffer_size=1000)
fordataindataset:
# order of elements in dataset is shuffled compared to initial dataset
We might want to expose a generator argument to the dataset, in order to control how to seed the data. Care must be taken on this because torch.Generator is not picklable (although random.Random seems to be), so it wouldn't play well with our DataLoader implementation, which serializes the dataset via pickle to send it to multiple processes.
Motivation
IterableDataset in PyTorch is a family of Datasets that implement the __iter__ method, in contrast to Map-Style datasets which implements __getitem__ and __len__.
All sampling strategies in PyTorch are focused on Map-Style datasets via the Sampler abstractions. Those abstractions do not work on IterableDatasets because they require knowing the __len__ of the dataset, and assume that random access on the dataset is available.
As of now, the user is required to implement dataset shuffling on his own, over and over. We should provide some basic shuffling strategies in PyTorch for this, given how widespread is operation is.
Pitch
Provide a ShuffleDataset class, inheriting from IterableDataset, which gives a basic random shuffling strategy based on a buffer_size.
Alternatives
There are currently no alternatives natively available in PyTorch as of now.
Additional context
This has been requested / discussed multiple times in the past, see #28743 for one example.
The text was updated successfully, but these errors were encountered:
fmassa
added
module: dataloader
Related to torch.utils.data.DataLoader and Sampler
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
labels
Sep 22, 2020
Hi @fmassa, I noticed that IterableDataset in torch 1.9 supports shuffling through IterableDataset().shuffle, but it’s not the case for the latest version of torch. I was curious why it was removed and if there is an alternative.
module: dataloaderRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
🚀 Feature
Provide a
ShuffleDataset
IterableDataset that can be composed with a user-defined dataset for performing random shuffling.The simplest way to implement this would be to follow the shuffling strategy from TensorFlow, which keeps a buffer of dataset elements, and once the buffer is full, randomly pop one element of the buffer and replace the popped element with a new element from the dataset.
The user could then implement shuffling on his own dataset as follows:
We might want to expose a
generator
argument to the dataset, in order to control how to seed the data. Care must be taken on this becausetorch.Generator
is not picklable (althoughrandom.Random
seems to be), so it wouldn't play well with our DataLoader implementation, which serializes the dataset via pickle to send it to multiple processes.Motivation
IterableDataset in PyTorch is a family of Datasets that implement the
__iter__
method, in contrast to Map-Style datasets which implements__getitem__
and__len__
.All sampling strategies in PyTorch are focused on Map-Style datasets via the
Sampler
abstractions. Those abstractions do not work on IterableDatasets because they require knowing the__len__
of the dataset, and assume that random access on the dataset is available.As of now, the user is required to implement dataset shuffling on his own, over and over. We should provide some basic shuffling strategies in PyTorch for this, given how widespread is operation is.
Pitch
Provide a
ShuffleDataset
class, inheriting fromIterableDataset
, which gives a basic random shuffling strategy based on abuffer_size
.Alternatives
There are currently no alternatives natively available in PyTorch as of now.
Additional context
This has been requested / discussed multiple times in the past, see #28743 for one example.
cc @ssnl @VitalyFedyunin
The text was updated successfully, but these errors were encountered: