-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IterableDataset #19228
Add IterableDataset #19228
Conversation
7c9e030
to
746065b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritamdamania87 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
this is currently awaiting (1) #19421 and (2) filling in the docs |
922cfc3
to
fc53d79
Compare
fc53d79
to
3c93a83
Compare
3c93a83
to
8caeca7
Compare
8caeca7
to
aba6a21
Compare
@pritamdamania87 doc is done! |
42a783d
to
e50f65f
Compare
@apaszke Soumith mentioned it'd be nice to have you review parts of this PR. I've looked through some of the new functionality added in this PR (IterableDataset), but it would be good if you can take a look at some of the changes to the existing code structure. Thanks! |
7598ba4
to
d0ce455
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritamdamania87 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@pritamdamania87 do let me know if this breaks anything internal |
Looks like renaming pin_memory_batch breaks a bunch of stuff which does |
@pritamdamania87 for the first error, I prefer fixing those who import private helpers. For the second, it’d be great if you can show/message me the trace, or at least the non confidential parts. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritamdamania87 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This is a modified version of pytorch#14705 since commit structure for that PR is quite messy. 1. Add `IterableDataset`. 3. So we have 2 data loader mods: `Iterable` and `Map`. 1. `Iterable` if the `dataset` is an instance of `IterableDataset` 2. `Map` o.w. 3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading. 3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`. 4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration. 5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`. 7. Import torch.utils.data in `torch/__init__.py` 9. data loader examples and documentations 10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate` Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023 Pull Request resolved: pytorch#19228 Reviewed By: bddppq Differential Revision: D15058152 fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
Summary: This is a modified version of pytorch#14705 since commit structure for that PR is quite messy. 1. Add `IterableDataset`. 3. So we have 2 data loader mods: `Iterable` and `Map`. 1. `Iterable` if the `dataset` is an instance of `IterableDataset` 2. `Map` o.w. 3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading. 3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`. 4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration. 5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`. 7. Import torch.utils.data in `torch/__init__.py` 9. data loader examples and documentations 10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate` Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023 Pull Request resolved: pytorch#19228 Reviewed By: bddppq Differential Revision: D15058152 fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
Seems like the PR is finished? Good job! |
@bethunebtj - we're targeting 1.2 release at the end of July / early August hopefully. But you can always use nightlies :) |
Is this functionality available in nightly? |
@rfalcon100 yes |
Summary: Back in April, malmaud added type annotations for `dataloader.py`. However, at about the same time, SsnL in #19228 replaced `_DataLoaderIter` with `_BaseDataLoaderIter` and two subclasses, `_SingleProcessDataLoaderIter`, and `_MultiProcessingDataLoaderIter`. However - probably because these changes happened in parallel at roughly the same time, the type stubs and several other references in the codebase were never updated to match this refactoring. I've gone ahead and done the updates to reflect the refactoring in #19228, which fixes the specific type stub/impelementation mismatch pointed out in #26673, although not the broader problem that pytorch doesn't have a test to make sure that the `.pyi` type stub files match the real API defined in `.py` files. Pull Request resolved: #27105 Differential Revision: D17813641 Pulled By: ezyang fbshipit-source-id: ed7ac025c8d6ad3f298dd073347ec83bb4b6600c
…ch#27105) Summary: Back in April, malmaud added type annotations for `dataloader.py`. However, at about the same time, SsnL in pytorch#19228 replaced `_DataLoaderIter` with `_BaseDataLoaderIter` and two subclasses, `_SingleProcessDataLoaderIter`, and `_MultiProcessingDataLoaderIter`. However - probably because these changes happened in parallel at roughly the same time, the type stubs and several other references in the codebase were never updated to match this refactoring. I've gone ahead and done the updates to reflect the refactoring in pytorch#19228, which fixes the specific type stub/impelementation mismatch pointed out in pytorch#26673, although not the broader problem that pytorch doesn't have a test to make sure that the `.pyi` type stub files match the real API defined in `.py` files. Pull Request resolved: pytorch#27105 Differential Revision: D17813641 Pulled By: ezyang fbshipit-source-id: ed7ac025c8d6ad3f298dd073347ec83bb4b6600c
This is a modified version of #14705 since commit structure for that PR is quite messy.
Add
IterableDataset
.So we have 2 data loader mods:
Iterable
andMap
.Iterable
if thedataset
is an instance ofIterableDataset
Map
o.w.Add better support for non-batch loading (i.e.,
batch_size=None
andbatch_sampler=None
). This is useful in doing things like bulk loading.Refactor
DataLoaderIter
into two classes,_SingleProcessDataLoaderIter
and_MultiProcessingDataLoaderIter
. Rename some methods to be more generic, e.g.,get_batch
->get_data
.Add
torch.utils.data.get_worker_info
which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used inIterableDataset.__iter__
andworker_init_fn
to do per-worker configuration.Add
ChainDataset
, which is the analog ofConcatDataset
forIterableDataset
.Import torch.utils.data in
torch/__init__.py
data loader examples and documentations
Use
get_worker_info
to detect whether we are in a worker process indefault_collate
Closes #17909, #18096, #19946, and some of #13023