-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sampler for IterableDataset #28743
Comments
Hi, could you let us know how you'd imagine IterableDataset to work with samplers? |
Sure,
|
These definitely make sense and are valid uses! They are a bit different
from the current sampler interface in PyTorch though, since the PyTorch
samplers are used for sampling keys before data loading rather than the
data after obtaining them. The current recommended way to implement the sub
sampling procedure you described is to directly put them in your
IterableDataset code. Do you think that is a reasonable alternative?
I think we probably should also provide some general IterableDataset
manipulation tools, e.g. counterparts to those of itertools and reservoir
sampling. That requires more discussion. Feel free to open an issue about
it if you also think they should exist. :)
…On Mon, Oct 28, 2019 at 06:01 Daniel J. H. ***@***.***> wrote:
Sure,
1.
when the user knows the IterableDataset's size in advance a sampler
should be a able to iterate the dataset and e.g. sub-sample it (similar to
itertools.compress)
2.
when the user does not know the IterableDataset's size in advance a
sampler should be able to e.g. sub-sample while iterating, as it can be
achieved e.g. with the reservoir sampling technique
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#28743?email_source=notifications&email_token=ABLJMZJUYDUBAWICQCTDSBLQQ2Z6HA5CNFSM4JFNFCB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECMKJWI#issuecomment-546874585>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLJMZLE6IM2I75GOQBRGWTQQ2Z6HANCNFSM4JFNFCBQ>
.
|
#26547 includes a distributed sampler inside We do that because we dont know the size of the dataset and we cant fit it entirely in memory, so the sampler selects random chunks with unknown size in memory, shuffle it and return to the #28841 discusses the required changes to |
Reservoir sampling essentially allows you to pick a uniformly distributed random batch of Once we go through the list once, we of course now know its length, and can revert back to case 1 afterwards. |
to continue @vincentqb discussion, one cannot do a perfect uniform sampler from a stream without requiring memory of size comparable to the whole stream. For example if we need to sample a batch of size 64 from the stream there's 1/2 probability one element is from the first half of the stream, there's probability (1/2)^64 that all elements are from the first half of the stream, really unlikely => at least one element has to be from the second half of the stream => you can't yield the first batch without going over half of the stream. |
Hey there. Just wanted to mention that I've just released a little package to perform sampling on |
Hi def get_tpu_sampler(dataset: torch.utils.data.dataset.Dataset): could you provide me with examples how to implement this with iterative datasets? thanks. |
Hi, I am having the same issue, Have you managed to do it? |
Distributed samplers with iterative datasets on TPU are a little bit problematic. I am also facing the same problem |
🚀 Feature
The IterableDataset is too restrictive by not allowing the combination with samplers. Sampling from a stream is well understood and possible on the fly. IterableDataset should support these use cases.
Motivation
The IterableDataset abstraction is great for abstracting a stream of data we want to iterate over in a forward fashion. Right now it is not compatible with samplers, though. From the docs:
Here are two different use-cases where sampling from an IterableDataset is necessary:
For example I have one IterableDataset per video (yielding clips), and I know the number of frames for each video and the total number of videos in advance. I can sample
k
random clips withand abstract over this in an IterableDataset to only walk once through all videos.
For example I have videos with clips but I don't want / can get the number of frames per video and therefore don't know the total size in advance. I can still sample
k
random clips out of an unknownn
total clips e.g. via reservoir sampling and only walk once through all videos.What are your thoughts on this?
cc @ssnl
The text was updated successfully, but these errors were encountered: