[wip, test CI] Add IterableDataset #14705

ssnl · 2018-12-03T15:30:11Z

Add IterableDataset.
Support non batched loading of traditional map-like dataset. This is useful in doing bulk loading.
So we have three data loader mods: Iterable (newly added), Map (newly added), and MapWithBatchedRead (old).
1. Iterable if the dataset is an instance of IterableDataset
2. Map if batch_size is None
3. MapWithBatchedRead is chosen otherwise.
Refactor DataLoaderIter into two classes, _SingleProcessDataLoaderIter and _MultiProcessingDataLoaderIter. Rename some methods to be more generic, e.g., get_batch -> get_data.
Add torch.utils.data.get_worker_info which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in IterableDataset.__iter__ and worker_init_fn to do per-worker configuration.
Add ChainDataset, which is the analog of ConcatDataset for IterableDataset.
Add convert_fn, which is meant to convert each loaded data into tensors. For the MapWithBatchedRead mode, fetched data is first converted using convert_fn and then collated. This shouldn't be much slower (if any) than the old approach of only using collate_fn.

torch/utils/data/dataset.py

apaszke · 2018-12-09T19:02:32Z

Can you please elaborate on the final design of iterable datasets? As I've mentioned in some PRs to C++ Dataset API, I think that our current DataLoader story is completely mismatched with the desire to have mutable datasets (including iterable datasets). In particular, you can't simply replicate the dataset, and have all workers produce the same samples, because that would be silly. Similarly, the Sampler is completely useless in this case. I really think we should work on alternative solutions for this API.

ssnl · 2018-12-10T01:17:11Z

test/test_dataloader.py

+ def __init__(self, sizes_for_all_workers):
+ self.sizes_for_all_workers = sizes_for_all_workers
+
+ def __iter__(self):


@apaszke Basically you can config each dataset replica differently in two different ways using torch.utils.data.get_worker_info() which returns the worker's id, seed and dataset replica:

In __iter__ for an iterable dataset (e.g., this example).

In worker_init_fn.

This requires user to write their dataset code with multiprocessing data loading in mind. But I think this is a reasonable requirement because (1) there is no general way to split an iterator across multiple processes (2) if people want to do sharding / bulk loading, how to split work among workers should be part of the design.

apaszke

Haven't read the new multiprocess iterator, but I had a few questions

torch/utils/data/_utils/worker.py

torch/utils/data/_utils/collate.py

apaszke · 2018-12-30T13:18:45Z

torch/utils/data/dataloader.py

@@ -70,47 +79,61 @@ class DataLoader(object):
 __initialized = False

 def __init__(self, dataset, batch_size=1, shuffle=False, sampler=None,
- batch_sampler=None, num_workers=0, collate_fn=default_collate,
+ batch_sampler=None, num_workers=0,
+ convert_fn=_utils.collate.default_convert,


Why can't the conversion be handled by collate_fn?

With collate_fn, the input is assumed to be a list of many data samples, and you want to collate each field in the data structure of each element. E.g., [([1, 2], [3, 4]), ([5, 6], [7, 8)] should be come (tensor([[1, 2], [5, 6]]), tensor([[3, 4], [7, 8]])). It does both elementwise conversion and collation.

However, when the dataset is an iterable, we want to only do conversion by considering the entire input as a single data sample. E.g., [([1, 2], [3, 4]), ([5, 6], [7, 8)] should be come a single tensor([[[1, 2], [5, 6]], [[3, 4], [7, 8]]]).

apaszke · 2018-12-30T13:19:07Z

torch/utils/data/_utils/pin_memory.py

 else:
- return batch
+ return data


Why the rename?

Because data loader can be used to load individual samples as well, and this PR provides better support for that use case. batch seems to assume that the data is always batched.

torch/utils/data/dataloader.py

apaszke · 2018-12-30T13:27:52Z

torch/utils/data/dataloader.py

+ if self.mode == DataLoaderMode.Map:
+ data = self.convert_fn(self.dataset[index])
+ else:
+ # mode == DataLoaderMode.MapWithBatchedRead:


Why do we need both Map and MapWithBatchedRead, when we only had a single mode before?

Because previous there is no easy way to to unbatch loading from a map-like dataset. One has to use batch_size=1 and provide a custom collate_fn which is just lambda x: x[0]. That feels counterintuitive to me. With all other changes in this PR, this was a simple addition, so I made it.

Another reason is to provide better support for the case where the sampler provides a list of indices, and the dataset does bulk loading.

ssnl · 2018-12-30T14:19:30Z

@apaszke There is no new multiprocessing iterator design. I just refactored it into two classes, each handling single or multi proc loading. The logic is exactly the same.

ssnl · 2019-01-15T13:39:33Z

@pytorchbot retest this please

Summary: Fixes #6968 Needed for #14705 Pull Request resolved: #16389 Differential Revision: D13861446 Pulled By: gchanan fbshipit-source-id: 7b8700b95aaf252d9669693dbddccb2302e58409

comments

Add ChainDataset (the analog of ConcatDataset but for IterableDataset) and tests Add doc entries Make torch.utils.data.* usable without another import after import torch Make IterableDataset return NotImplemented in __len__ so fallback of some functions work

…o_sampler

ssnl · 2019-04-13T03:33:23Z

close in favor of #19228

Summary: This is a modified version of pytorch/pytorch#14705 since commit structure for that PR is quite messy. 1. Add `IterableDataset`. 3. So we have 2 data loader mods: `Iterable` and `Map`. 1. `Iterable` if the `dataset` is an instance of `IterableDataset` 2. `Map` o.w. 3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading. 3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`. 4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration. 5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`. 7. Import torch.utils.data in `torch/__init__.py` 9. data loader examples and documentations 10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate` Closes pytorch/pytorch#17909, pytorch/pytorch#18096, pytorch/pytorch#19946, and some of pytorch/pytorch#13023 Pull Request resolved: pytorch/pytorch#19228 Reviewed By: bddppq Differential Revision: D15058152 fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde

Summary: This is a modified version of pytorch#14705 since commit structure for that PR is quite messy. 1. Add `IterableDataset`. 3. So we have 2 data loader mods: `Iterable` and `Map`. 1. `Iterable` if the `dataset` is an instance of `IterableDataset` 2. `Map` o.w. 3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading. 3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`. 4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration. 5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`. 7. Import torch.utils.data in `torch/__init__.py` 9. data loader examples and documentations 10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate` Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023 Pull Request resolved: pytorch#19228 Reviewed By: bddppq Differential Revision: D15058152 fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde

ssnl force-pushed the dl_no_sampler branch 4 times, most recently from 6ae6c8c to 2ccf231 Compare December 4, 2018 18:15

kittipatv reviewed Dec 7, 2018

View reviewed changes

torch/utils/data/dataset.py Outdated Show resolved Hide resolved

ssnl commented Dec 10, 2018

View reviewed changes

ssnl force-pushed the dl_no_sampler branch from 2ccf231 to 6025073 Compare December 30, 2018 10:10

apaszke reviewed Dec 30, 2018

View reviewed changes

ssnl force-pushed the dl_no_sampler branch 3 times, most recently from 485927e to 2e264a0 Compare January 1, 2019 11:36

ssnl mentioned this pull request Jan 2, 2019

[DataLoader] Fix Windows worker exit detection, fix test_proper_exit #15665

Closed

ssnl force-pushed the dl_no_sampler branch from 2e264a0 to 7a646ba Compare January 15, 2019 09:59

ssnl mentioned this pull request Jan 15, 2019

[DataLoader] add method to get related info (e.g., dataset) in worker #14612

Closed

ssnl force-pushed the dl_no_sampler branch from 7a646ba to 7e6cf38 Compare January 15, 2019 10:04

ssnl force-pushed the dl_no_sampler branch from c7559ec to 0afbc91 Compare January 23, 2019 10:25

ssnl mentioned this pull request Jan 25, 2019

Add stack & cat support for CPU Half #16389

Closed

ssnl added 9 commits January 29, 2019 16:16

Add torch.utils.data.get_worker_info

90f8d60

Add IterableDataset

dd8afc5

test_proper_exit prints more informative error message

94b417d

fix default_collate

a2d7fe3

Fix single process loading, Add check in ConcatDataset, Add tests

584892c

add __len__ to tested iterable dataset to support zip in py2

62cf7c2

Address some comments; Extended timeout

e1f766a

Add py2 compatibility for TestProperExitIterableDataset; address some

877eea9

comments

ssnl added 3 commits March 25, 2019 11:15

test_iterable_dataset: better comparison using sorted lists

5a61330

Use subprocess to test_iterable_dataset in multiprocessing

f77f160

Merge branch 'master' of https://github.com/pytorch/pytorch into dl_n…

d9ce1d2

…o_sampler

ssnl force-pushed the dl_no_sampler branch from 47f6d69 to d9ce1d2 Compare March 25, 2019 15:23

use namedtuple to make flake8 happy

36d74e4

ssnl force-pushed the dl_no_sampler branch 2 times, most recently from edd2913 to 0c40b91 Compare March 25, 2019 18:35

Merge branch 'master' of https://github.com/pytorch/pytorch into dl_n…

d895fc8

…o_sampler

ssnl force-pushed the dl_no_sampler branch from 515d68b to d895fc8 Compare March 26, 2019 17:27

ssnl added 6 commits March 26, 2019 21:55

Merge branch 'master' of https://github.com/pytorch/pytorch into dl_n…

e312aee

…o_sampler

enable stderr for test_iterable_dataset

68e9d2f

Merge branch 'master' of https://github.com/pytorch/pytorch into dl_n…

9175ef6

…o_sampler

Merge branch 'master' of https://github.com/pytorch/pytorch into dl_n…

d10865c

…o_sampler

Merge branch 'master' of https://github.com/pytorch/pytorch into dl_n…

3b6c831

…o_sampler

Add shutdown_worker to prevent code duplication

21f6270

ssnl force-pushed the dl_no_sampler branch 3 times, most recently from 5236cb5 to da54df7 Compare March 31, 2019 04:45

Detect error in init

cf1e9f6

ssnl force-pushed the dl_no_sampler branch 2 times, most recently from 5cb90f8 to 74ea324 Compare March 31, 2019 15:52

debug ci prints

fbc0a3d

ssnl force-pushed the dl_no_sampler branch from 74ea324 to fbc0a3d Compare March 31, 2019 16:23

ssnl mentioned this pull request Apr 13, 2019

Add IterableDataset #19228

Closed

ssnl closed this Apr 13, 2019

ezyang added the open source label Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip, test CI] Add IterableDataset #14705

[wip, test CI] Add IterableDataset #14705

ssnl commented Dec 3, 2018 •

edited

Loading

apaszke commented Dec 9, 2018

ssnl Dec 10, 2018 •

edited

Loading

apaszke left a comment

apaszke Dec 30, 2018

ssnl Dec 30, 2018

apaszke Dec 30, 2018

ssnl Dec 30, 2018

apaszke Dec 30, 2018

ssnl Dec 30, 2018

ssnl Jan 1, 2019

ssnl commented Dec 30, 2018

ssnl commented Jan 15, 2019

ssnl commented Apr 13, 2019

[wip, test CI] Add IterableDataset #14705

[wip, test CI] Add IterableDataset #14705

Conversation

ssnl commented Dec 3, 2018 • edited Loading

apaszke commented Dec 9, 2018

ssnl Dec 10, 2018 • edited Loading

Choose a reason for hiding this comment

apaszke left a comment

Choose a reason for hiding this comment

apaszke Dec 30, 2018

Choose a reason for hiding this comment

ssnl Dec 30, 2018

Choose a reason for hiding this comment

apaszke Dec 30, 2018

Choose a reason for hiding this comment

ssnl Dec 30, 2018

Choose a reason for hiding this comment

apaszke Dec 30, 2018

Choose a reason for hiding this comment

ssnl Dec 30, 2018

Choose a reason for hiding this comment

ssnl Jan 1, 2019

Choose a reason for hiding this comment

ssnl commented Dec 30, 2018

ssnl commented Jan 15, 2019

ssnl commented Apr 13, 2019

ssnl commented Dec 3, 2018 •

edited

Loading

ssnl Dec 10, 2018 •

edited

Loading