Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampler for IterableDataset #28743

Open
daniel-j-h opened this issue Oct 26, 2019 · 10 comments
Open

Sampler for IterableDataset #28743

daniel-j-h opened this issue Oct 26, 2019 · 10 comments
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@daniel-j-h
Copy link

daniel-j-h commented Oct 26, 2019

🚀 Feature

The IterableDataset is too restrictive by not allowing the combination with samplers. Sampling from a stream is well understood and possible on the fly. IterableDataset should support these use cases.

Motivation

The IterableDataset abstraction is great for abstracting a stream of data we want to iterate over in a forward fashion. Right now it is not compatible with samplers, though. From the docs:

Neither sampler nor batch_sampler is compatible with iterable-style datasets, since such datasets have no notion of a key or an index.

Here are two different use-cases where sampling from an IterableDataset is necessary:

  1. The user knows the total size in advance

For example I have one IterableDataset per video (yielding clips), and I know the number of frames for each video and the total number of videos in advance. I can sample k random clips with

pick = set(random.sample(range(self.total), k))
mask = [i in pick for i in range(self.total)]

it = itertools.chain(*self.videos)
it = itertools.compress(it, mask)

and abstract over this in an IterableDataset to only walk once through all videos.

  1. The user does not know the total size in advance

For example I have videos with clips but I don't want / can get the number of frames per video and therefore don't know the total size in advance. I can still sample k random clips out of an unknown n total clips e.g. via reservoir sampling and only walk once through all videos.

What are your thoughts on this?

cc @ssnl

@ssnl
Copy link
Collaborator

ssnl commented Oct 28, 2019

Hi, could you let us know how you'd imagine IterableDataset to work with samplers?

@daniel-j-h
Copy link
Author

Sure,

  1. when the user knows the IterableDataset's size in advance a sampler should be a able to iterate the dataset and e.g. sub-sample it (similar to itertools.compress)

  2. when the user does not know the IterableDataset's size in advance a sampler should be able to e.g. sub-sample while iterating, as it can be achieved e.g. with the reservoir sampling technique

@ssnl
Copy link
Collaborator

ssnl commented Oct 28, 2019 via email

@zhangguanheng66 zhangguanheng66 added module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 28, 2019
@thiagocrepaldi
Copy link
Collaborator

#26547 includes a distributed sampler inside ChunkDataset, (which inherits from IterableDataset).

We do that because we dont know the size of the dataset and we cant fit it entirely in memory, so the sampler selects random chunks with unknown size in memory, shuffle it and return to the DataLoader

#28841 discusses the required changes to DistributedSampler to be used within ChunkDataset or as @cpuhrsch suggested, if we could pass a sampler to DataLoader ctor to be used by the IterableDataset - DataLoader would still use its internal infinite sampler until StopIteration is raised

@vincentqb
Copy link
Contributor

vincentqb commented Nov 11, 2019

  1. The user does not know the total size in advance

For example I have videos with clips but I don't want / can get the number of frames per video and therefore don't know the total size in advance. I can still sample k random clips out of an unknown n total clips e.g. via reservoir sampling and only walk once through all videos.

Reservoir sampling essentially allows you to pick a uniformly distributed random batch of k samples from an iterable after running through the whole list of items without knowing its length n ahead of time. If we want to map all items to some batch, we still have to go through the iterable n/k times using this. Is that what you are looking for?

Once we go through the list once, we of course now know its length, and can revert back to case 1 afterwards.

@titusnicolae
Copy link

to continue @vincentqb discussion, one cannot do a perfect uniform sampler from a stream without requiring memory of size comparable to the whole stream. For example if we need to sample a batch of size 64 from the stream there's 1/2 probability one element is from the first half of the stream, there's probability (1/2)^64 that all elements are from the first half of the stream, really unlikely => at least one element has to be from the second half of the stream => you can't yield the first batch without going over half of the stream.

@MaxHalford
Copy link

Hey there. Just wanted to mention that I've just released a little package to perform sampling on IterableDatasets. I'm happy using what I've done to fuel discussion on the topic and deprecating my package if ever its behaviour gets merged into pytorch.

@rabeehkarimimahabadi
Copy link

Hi
I have multiple large-scale datasets and I need to have distributed sampler to train the models on TPUs with pytroch XLA, in the non-iterative case one can write:

def get_tpu_sampler(dataset: torch.utils.data.dataset.Dataset):
if xm.xrt_world_size() <= 1:
return RandomSampler(dataset)
return DistributedSampler(dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal())

could you provide me with examples how to implement this with iterative datasets?

thanks.

@MrRobot2211
Copy link

e required changes to DistributedSampler to be used within ChunkDataset or as @cpuhrsch suggested, if we could pass a sampler to DataLoader ctor to be used by the IterableDataset - DataLoader would still use its internal infinite sampler until StopIteration is raised

Hi, I am having the same issue, Have you managed to do it?

@iliemihai
Copy link

Distributed samplers with iterative datasets on TPU are a little bit problematic. I am also facing the same problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

10 participants