Add IterableDataset #19228

ssnl · 2019-04-13T03:33:13Z

This is a modified version of #14705 since commit structure for that PR is quite messy.

Add IterableDataset.
So we have 2 data loader mods: Iterable and Map.
1. Iterable if the dataset is an instance of IterableDataset
2. Map o.w.
Add better support for non-batch loading (i.e., batch_size=None and batch_sampler=None). This is useful in doing things like bulk loading.
Refactor DataLoaderIter into two classes, _SingleProcessDataLoaderIter and _MultiProcessingDataLoaderIter. Rename some methods to be more generic, e.g., get_batch -> get_data.
Add torch.utils.data.get_worker_info which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in IterableDataset.__iter__ and worker_init_fn to do per-worker configuration.
Add ChainDataset, which is the analog of ConcatDataset for IterableDataset.
Import torch.utils.data in torch/__init__.py
data loader examples and documentations
Use get_worker_info to detect whether we are in a worker process in default_collate

Closes #17909, #18096, #19946, and some of #13023

facebook-github-bot

@pritamdamania87 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ssnl · 2019-04-24T00:34:39Z

this is currently awaiting (1) #19421 and (2) filling in the docs

docs/source/data.rst

torch/utils/data/_utils/__init__.py

torch/utils/data/dataloader.py

torch/utils/data/_utils/signal_handling.py

torch/utils/data/_utils/worker.py

docs/source/data.rst

torch/utils/data/dataloader.py

torch/utils/data/dataset.py

ssnl · 2019-04-29T04:29:55Z

@pritamdamania87 doc is done!

pritamdamania87 · 2019-04-29T19:13:52Z

@apaszke Soumith mentioned it'd be nice to have you review parts of this PR. I've looked through some of the new functionality added in this PR (IterableDataset), but it would be good if you can take a look at some of the changes to the existing code structure. Thanks!

facebook-github-bot

@pritamdamania87 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ssnl · 2019-04-29T22:58:35Z

@pritamdamania87 do let me know if this breaks anything internal

pritamdamania87 · 2019-04-30T01:23:50Z

@pritamdamania87 do let me know if this breaks anything internal

Looks like renaming pin_memory_batch breaks a bunch of stuff which does from torch.utils.data._utils.pin_memory import pin_memory_batch. Also, it looks like I see a lot of errors coming from child processes: TypeError: unhashable type: 'list'. Not sure what might be the source of those though.

ssnl · 2019-04-30T01:32:43Z

@pritamdamania87 for the first error, I prefer fixing those who import private helpers. For the second, it’d be great if you can show/message me the trace, or at least the non confidential parts. Thanks!

torch/utils/data/dataloader.py

facebook-github-bot

@pritamdamania87 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: This is a modified version of pytorch#14705 since commit structure for that PR is quite messy. 1. Add `IterableDataset`. 3. So we have 2 data loader mods: `Iterable` and `Map`. 1. `Iterable` if the `dataset` is an instance of `IterableDataset` 2. `Map` o.w. 3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading. 3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`. 4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration. 5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`. 7. Import torch.utils.data in `torch/__init__.py` 9. data loader examples and documentations 10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate` Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023 Pull Request resolved: pytorch#19228 Reviewed By: bddppq Differential Revision: D15058152 fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde

bethunebtj · 2019-07-03T16:33:05Z

Seems like the PR is finished? Good job!
Btw, when will this be released? I'm looking forward to this. Again, thanks a lot!

dzhulgakov · 2019-07-11T07:24:31Z

@bethunebtj - we're targeting 1.2 release at the end of July / early August hopefully. But you can always use nightlies :)

rfalcon100 · 2019-07-18T11:11:12Z

Is this functionality available in nightly?

soumith · 2019-07-18T11:25:24Z

@rfalcon100 yes

Summary: Back in April, malmaud added type annotations for `dataloader.py`. However, at about the same time, SsnL in #19228 replaced `_DataLoaderIter` with `_BaseDataLoaderIter` and two subclasses, `_SingleProcessDataLoaderIter`, and `_MultiProcessingDataLoaderIter`. However - probably because these changes happened in parallel at roughly the same time, the type stubs and several other references in the codebase were never updated to match this refactoring. I've gone ahead and done the updates to reflect the refactoring in #19228, which fixes the specific type stub/impelementation mismatch pointed out in #26673, although not the broader problem that pytorch doesn't have a test to make sure that the `.pyi` type stub files match the real API defined in `.py` files. Pull Request resolved: #27105 Differential Revision: D17813641 Pulled By: ezyang fbshipit-source-id: ed7ac025c8d6ad3f298dd073347ec83bb4b6600c

…ch#27105) Summary: Back in April, malmaud added type annotations for `dataloader.py`. However, at about the same time, SsnL in pytorch#19228 replaced `_DataLoaderIter` with `_BaseDataLoaderIter` and two subclasses, `_SingleProcessDataLoaderIter`, and `_MultiProcessingDataLoaderIter`. However - probably because these changes happened in parallel at roughly the same time, the type stubs and several other references in the codebase were never updated to match this refactoring. I've gone ahead and done the updates to reflect the refactoring in pytorch#19228, which fixes the specific type stub/impelementation mismatch pointed out in pytorch#26673, although not the broader problem that pytorch doesn't have a test to make sure that the `.pyi` type stub files match the real API defined in `.py` files. Pull Request resolved: pytorch#27105 Differential Revision: D17813641 Pulled By: ezyang fbshipit-source-id: ed7ac025c8d6ad3f298dd073347ec83bb4b6600c

kuraga · 2024-03-21T15:23:05Z

On code introduced here:

ssnl mentioned this pull request Apr 13, 2019

[wip, test CI] Add IterableDataset #14705

Closed

ssnl force-pushed the dl_no_sampler_test branch 3 times, most recently from 7c9e030 to 746065b Compare April 17, 2019 00:29

dzhulgakov requested a review from pritamdamania87 April 19, 2019 07:56

facebook-github-bot reviewed Apr 24, 2019

View reviewed changes

ssnl force-pushed the dl_no_sampler_test branch 3 times, most recently from 922cfc3 to fc53d79 Compare April 24, 2019 17:23

pritamdamania87 requested a review from aartibasant April 24, 2019 18:08

pritamdamania87 reviewed Apr 24, 2019

View reviewed changes

ssnl force-pushed the dl_no_sampler_test branch from fc53d79 to 3c93a83 Compare April 25, 2019 05:42

ssnl mentioned this pull request Apr 25, 2019

DataLoader docs update to describe how workers are managed, including Windows. #18091

Closed

ssnl force-pushed the dl_no_sampler_test branch from 3c93a83 to 8caeca7 Compare April 29, 2019 04:09

ssnl mentioned this pull request Apr 29, 2019

custome collate function cannot get _use_shared_memory #17909

Closed

ssnl changed the title ~~[wip, test CI] Add IterableDataset~~ Add IterableDataset Apr 29, 2019

ssnl force-pushed the dl_no_sampler_test branch from 8caeca7 to aba6a21 Compare April 29, 2019 04:29

ssnl force-pushed the dl_no_sampler_test branch 2 times, most recently from 42a783d to e50f65f Compare April 29, 2019 15:33

pritamdamania87 requested a review from apaszke April 29, 2019 19:10

ssnl force-pushed the dl_no_sampler_test branch from 7598ba4 to d0ce455 Compare April 29, 2019 19:43

facebook-github-bot reviewed Apr 29, 2019

View reviewed changes

pritamdamania87 reviewed Apr 30, 2019

View reviewed changes

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

ssnl mentioned this pull request Apr 30, 2019

DataLoader for Large Corpus File #19946

Closed

ssnl added 3 commits June 20, 2019 16:35

address comments

6a908d0

say more about custom collate_fn in docs

eae470e

Raise more meaning errors in len(loader) for iterable dataset, and test

cfccdb6

ssnl force-pushed the dl_no_sampler_test branch from 11775aa to cfccdb6 Compare June 20, 2019 20:35

facebook-github-bot reviewed Jun 20, 2019

View reviewed changes

ssnl mentioned this pull request Jun 20, 2019

[TESTING] [DataLoader] delay cancel_join_thread until receiver exits #21069

Closed

facebook-github-bot closed this in facebookresearch/ReAgent@00e078b Jun 21, 2019

ssnl deleted the dl_no_sampler_test branch June 21, 2019 03:27

ssnl mentioned this pull request Jun 24, 2019

[dataloader] type annotation is broken #22135

Closed

faroit mentioned this pull request Jul 9, 2019

List example repos pescadores/pescador#144

Open

rgommers mentioned this pull request Jul 22, 2019

[utils.bottleneck] Bottleneck crashes with multi-threaded data loader #6313

Closed

alpha0422 mentioned this pull request Aug 1, 2019

Performance Regression of Dataloader #23642

Closed

ezyang mentioned this pull request Aug 9, 2019

Not compatible with PyTorch 1.2.0 msamogh/nonechucks#16

Closed

ssnl mentioned this pull request Aug 23, 2019

how to make dataloader return multiple tensors each time #17773

Closed

This was referenced Sep 20, 2019

Chunk Dataset API #26545

Closed

ChunkDataset API proposal #26547

Closed

gkossakowski mentioned this pull request Sep 27, 2019

Batched Dataloader #26957

Open

ngoldbaum mentioned this pull request Sep 30, 2019

Replace references to _DataLoaderIter with _BaseDataLoaderIter #27105

Closed

cpuhrsch mentioned this pull request Nov 22, 2019

Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) pytorch/text#624

Merged

kuraga mentioned this pull request Mar 6, 2024

On Dataset, IterableDataset inheritance #120139

Open

kuraga mentioned this pull request Mar 21, 2024

On Lack of Default __len__ in Python Abstract Base Classes #122410

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IterableDataset #19228

Add IterableDataset #19228

ssnl commented Apr 13, 2019 •

edited

facebook-github-bot left a comment

ssnl commented Apr 24, 2019

ssnl commented Apr 29, 2019

pritamdamania87 commented Apr 29, 2019

facebook-github-bot left a comment

ssnl commented Apr 29, 2019

pritamdamania87 commented Apr 30, 2019 •

edited

ssnl commented Apr 30, 2019

facebook-github-bot left a comment

bethunebtj commented Jul 3, 2019

dzhulgakov commented Jul 11, 2019

rfalcon100 commented Jul 18, 2019

soumith commented Jul 18, 2019

kuraga commented Mar 21, 2024

Add IterableDataset #19228

Add IterableDataset #19228

Conversation

ssnl commented Apr 13, 2019 • edited

facebook-github-bot left a comment

Choose a reason for hiding this comment

ssnl commented Apr 24, 2019

ssnl commented Apr 29, 2019

pritamdamania87 commented Apr 29, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

ssnl commented Apr 29, 2019

pritamdamania87 commented Apr 30, 2019 • edited

ssnl commented Apr 30, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

bethunebtj commented Jul 3, 2019

dzhulgakov commented Jul 11, 2019

rfalcon100 commented Jul 18, 2019

soumith commented Jul 18, 2019

kuraga commented Mar 21, 2024

ssnl commented Apr 13, 2019 •

edited

pritamdamania87 commented Apr 30, 2019 •

edited