Sampler for IterableDataset #28743

daniel-j-h · 2019-10-26T12:50:25Z

🚀 Feature

The IterableDataset is too restrictive by not allowing the combination with samplers. Sampling from a stream is well understood and possible on the fly. IterableDataset should support these use cases.

Motivation

The IterableDataset abstraction is great for abstracting a stream of data we want to iterate over in a forward fashion. Right now it is not compatible with samplers, though. From the docs:

Neither sampler nor batch_sampler is compatible with iterable-style datasets, since such datasets have no notion of a key or an index.

Here are two different use-cases where sampling from an IterableDataset is necessary:

The user knows the total size in advance

For example I have one IterableDataset per video (yielding clips), and I know the number of frames for each video and the total number of videos in advance. I can sample k random clips with

pick = set(random.sample(range(self.total), k))
mask = [i in pick for i in range(self.total)]

it = itertools.chain(*self.videos)
it = itertools.compress(it, mask)

and abstract over this in an IterableDataset to only walk once through all videos.

The user does not know the total size in advance

For example I have videos with clips but I don't want / can get the number of frames per video and therefore don't know the total size in advance. I can still sample k random clips out of an unknown n total clips e.g. via reservoir sampling and only walk once through all videos.

What are your thoughts on this?

cc @ssnl

The text was updated successfully, but these errors were encountered:

ssnl · 2019-10-28T00:00:01Z

Hi, could you let us know how you'd imagine IterableDataset to work with samplers?

daniel-j-h · 2019-10-28T10:00:00Z

Sure,

when the user knows the IterableDataset's size in advance a sampler should be a able to iterate the dataset and e.g. sub-sample it (similar to itertools.compress)
when the user does not know the IterableDataset's size in advance a sampler should be able to e.g. sub-sample while iterating, as it can be achieved e.g. with the reservoir sampling technique

ssnl · 2019-10-28T15:30:09Z

These definitely make sense and are valid uses! They are a bit different from the current sampler interface in PyTorch though, since the PyTorch samplers are used for sampling keys before data loading rather than the data after obtaining them. The current recommended way to implement the sub sampling procedure you described is to directly put them in your IterableDataset code. Do you think that is a reasonable alternative? I think we probably should also provide some general IterableDataset manipulation tools, e.g. counterparts to those of itertools and reservoir sampling. That requires more discussion. Feel free to open an issue about it if you also think they should exist. :)

…

On Mon, Oct 28, 2019 at 06:01 Daniel J. H. ***@***.***> wrote: Sure, 1. when the user knows the IterableDataset's size in advance a sampler should be a able to iterate the dataset and e.g. sub-sample it (similar to itertools.compress) 2. when the user does not know the IterableDataset's size in advance a sampler should be able to e.g. sub-sample while iterating, as it can be achieved e.g. with the reservoir sampling technique — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#28743?email_source=notifications&email_token=ABLJMZJUYDUBAWICQCTDSBLQQ2Z6HA5CNFSM4JFNFCB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECMKJWI#issuecomment-546874585>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLJMZLE6IM2I75GOQBRGWTQQ2Z6HANCNFSM4JFNFCBQ> .

thiagocrepaldi · 2019-11-04T19:24:38Z

#26547 includes a distributed sampler inside ChunkDataset, (which inherits from IterableDataset).

We do that because we dont know the size of the dataset and we cant fit it entirely in memory, so the sampler selects random chunks with unknown size in memory, shuffle it and return to the DataLoader

#28841 discusses the required changes to DistributedSampler to be used within ChunkDataset or as @cpuhrsch suggested, if we could pass a sampler to DataLoader ctor to be used by the IterableDataset - DataLoader would still use its internal infinite sampler until StopIteration is raised

vincentqb · 2019-11-11T23:50:45Z

The user does not know the total size in advance

For example I have videos with clips but I don't want / can get the number of frames per video and therefore don't know the total size in advance. I can still sample k random clips out of an unknown n total clips e.g. via reservoir sampling and only walk once through all videos.

Reservoir sampling essentially allows you to pick a uniformly distributed random batch of k samples from an iterable after running through the whole list of items without knowing its length n ahead of time. If we want to map all items to some batch, we still have to go through the iterable n/k times using this. Is that what you are looking for?

Once we go through the list once, we of course now know its length, and can revert back to case 1 afterwards.

titusnicolae · 2020-01-24T17:08:53Z

to continue @vincentqb discussion, one cannot do a perfect uniform sampler from a stream without requiring memory of size comparable to the whole stream. For example if we need to sample a batch of size 64 from the stream there's 1/2 probability one element is from the first half of the stream, there's probability (1/2)^64 that all elements are from the first half of the stream, really unlikely => at least one element has to be from the second half of the stream => you can't yield the first batch without going over half of the stream.

MaxHalford · 2020-08-12T07:23:43Z

Hey there. Just wanted to mention that I've just released a little package to perform sampling on IterableDatasets. I'm happy using what I've done to fuel discussion on the topic and deprecating my package if ever its behaviour gets merged into pytorch.

rabeehkarimimahabadi · 2020-11-07T19:18:14Z

Hi
I have multiple large-scale datasets and I need to have distributed sampler to train the models on TPUs with pytroch XLA, in the non-iterative case one can write:

def get_tpu_sampler(dataset: torch.utils.data.dataset.Dataset):
if xm.xrt_world_size() <= 1:
return RandomSampler(dataset)
return DistributedSampler(dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal())

could you provide me with examples how to implement this with iterative datasets?

thanks.

MrRobot2211 · 2022-01-13T06:34:11Z

e required changes to DistributedSampler to be used within ChunkDataset or as @cpuhrsch suggested, if we could pass a sampler to DataLoader ctor to be used by the IterableDataset - DataLoader would still use its internal infinite sampler until StopIteration is raised

Hi, I am having the same issue, Have you managed to do it?

iliemihai · 2022-10-21T09:05:24Z

Distributed samplers with iterative datasets on TPU are a little bit problematic. I am also facing the same problem

zhangguanheng66 added module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 28, 2019

cpuhrsch mentioned this issue Nov 4, 2019

Add length and padding keyworks to DistributedSampler #28841

Closed

vincentqb mentioned this issue Nov 18, 2019

Shared Dataset Functionality #24915

Open

jysohn23 mentioned this issue Jan 17, 2020

To use DistributedSampler or not? pytorch/xla#1541

Closed

DeNeutoy mentioned this issue Jan 30, 2020

Data V2 allenai/allennlp#3700

Merged

simonverret mentioned this issue Jan 31, 2020

Silent failing of batch_sampler when the data points are lists of tensors. #32851

Open

megaserg mentioned this issue Mar 21, 2020

Petastorm sharding + Distributed PyTorch uber/petastorm#508

Open

VitalyFedyunin mentioned this issue Jul 16, 2020

[RFC, Tracker] DataLoader improvements #41292

Open

8 tasks

davidkartchner mentioned this issue Aug 13, 2020

Option to run dataloader on single process for distributed training Lightning-AI/pytorch-lightning#2935

Closed

fmassa mentioned this issue Sep 22, 2020

[Feature Request] ShuffleDataset #45114

Closed

datistiquo mentioned this issue Feb 5, 2021

Getting MultipleNegativeRanking loss right UKPLab/sentence-transformers#745

Open

This was referenced Sep 26, 2021

[REQUEST] Remove dataset length requirements microsoft/DeepSpeed#1371

Closed

New error using the new update. lucidrains/DALLE-pytorch#359

Closed

dakinggg mentioned this issue Nov 3, 2022

Update distributed docs mosaicml/composer#1696

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampler for IterableDataset #28743

Sampler for IterableDataset #28743

daniel-j-h commented Oct 26, 2019 •

edited by pytorch-probot bot

ssnl commented Oct 28, 2019

daniel-j-h commented Oct 28, 2019

ssnl commented Oct 28, 2019 via email

thiagocrepaldi commented Nov 4, 2019

vincentqb commented Nov 11, 2019 •

edited

titusnicolae commented Jan 24, 2020

MaxHalford commented Aug 12, 2020

rabeehkarimimahabadi commented Nov 7, 2020

MrRobot2211 commented Jan 13, 2022

iliemihai commented Oct 21, 2022

Sampler for IterableDataset #28743

Sampler for IterableDataset #28743

Comments

daniel-j-h commented Oct 26, 2019 • edited by pytorch-probot bot

🚀 Feature

Motivation

ssnl commented Oct 28, 2019

daniel-j-h commented Oct 28, 2019

ssnl commented Oct 28, 2019 via email

thiagocrepaldi commented Nov 4, 2019

vincentqb commented Nov 11, 2019 • edited

titusnicolae commented Jan 24, 2020

MaxHalford commented Aug 12, 2020

rabeehkarimimahabadi commented Nov 7, 2020

MrRobot2211 commented Jan 13, 2022

iliemihai commented Oct 21, 2022

daniel-j-h commented Oct 26, 2019 •

edited by pytorch-probot bot

vincentqb commented Nov 11, 2019 •

edited