DataLoader for Large Corpus File #19946

zegzag · 2019-04-30T02:44:02Z

🚀 Feature

DataLoader for large corpus file, supporting multi-process and multi-thread and memory optimization

Motivation

Thanks a lot for your pytorch framework, I've benefited a lot in work by using it. But I also got some little confusions with regard to data loading and processing.
In NLP cases, suppose I have a very large corpus file with labeled data "corpus.csv", which is too large to load it into memery once time, or even it is infinite large. I want to train my model on this dataset from top to bottom by mini batches. That means in each batch, I yield certain text lines from file begining to file ending to train my model (cannot support shuffle). I can do this with python generator easily. However, I don't know how to write subclass of torch.utils.data.Dataset and use torch.utils.data.DataLoader to customize my own dataset on "corpus.csv". Since "corpus.csv" cannot be loaded into memory totally, I can not write __getitem__ and __len__ attribute.
I would be grateful if you could develop a module that satisfy the cases above ? Since merely by python generator, it is too hard for me to write DataLoader supporting multi-process and multi-thread loading and memory optimization.

Alternatives

By multi processing, one can be training the model on this batch and loading the next batch at the same time.

The text was updated successfully, but these errors were encountered:

ssnl · 2019-04-30T04:36:21Z

This is definitely a valid feature request. In fact, I implemented something called IterableDataset that will be used as an iterable (e.g., generator, data stream) in PyTorch. It is currently being reviewed at #19228 .

Summary: This is a modified version of pytorch/pytorch#14705 since commit structure for that PR is quite messy. 1. Add `IterableDataset`. 3. So we have 2 data loader mods: `Iterable` and `Map`. 1. `Iterable` if the `dataset` is an instance of `IterableDataset` 2. `Map` o.w. 3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading. 3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`. 4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration. 5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`. 7. Import torch.utils.data in `torch/__init__.py` 9. data loader examples and documentations 10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate` Closes pytorch/pytorch#17909, pytorch/pytorch#18096, pytorch/pytorch#19946, and some of pytorch/pytorch#13023 Pull Request resolved: pytorch/pytorch#19228 Reviewed By: bddppq Differential Revision: D15058152 fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde

Summary: This is a modified version of pytorch#14705 since commit structure for that PR is quite messy. 1. Add `IterableDataset`. 3. So we have 2 data loader mods: `Iterable` and `Map`. 1. `Iterable` if the `dataset` is an instance of `IterableDataset` 2. `Map` o.w. 3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading. 3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`. 4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration. 5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`. 7. Import torch.utils.data in `torch/__init__.py` 9. data loader examples and documentations 10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate` Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023 Pull Request resolved: pytorch#19228 Reviewed By: bddppq Differential Revision: D15058152 fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde

ssnl · 2019-06-21T03:27:12Z

closed via #19228

Summary: This is a modified version of pytorch#14705 since commit structure for that PR is quite messy. 1. Add `IterableDataset`. 3. So we have 2 data loader mods: `Iterable` and `Map`. 1. `Iterable` if the `dataset` is an instance of `IterableDataset` 2. `Map` o.w. 3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading. 3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`. 4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration. 5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`. 7. Import torch.utils.data in `torch/__init__.py` 9. data loader examples and documentations 10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate` Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023 Pull Request resolved: pytorch#19228 Reviewed By: bddppq Differential Revision: D15058152 fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde

rabeehkarimimahabadi · 2020-11-07T19:05:16Z

Hi
Could you give an example how to use iterable datasets with distributed sampler to be able to efficiently sample datasets over multiple TPU cores? thanks

ssnl self-assigned this Apr 30, 2019

ssnl mentioned this issue Apr 30, 2019

Add IterableDataset #19228

Closed

ssnl added the module: dataloader Related to torch.utils.data.DataLoader and Sampler label Apr 30, 2019

gchanan added feature A request for a proper, new feature. triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 1, 2019

ssnl closed this as completed Jun 21, 2019

mrbungie mentioned this issue Sep 18, 2019

make_label_dictionary takes too much to run and eventually dies flairNLP/flair#1128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataLoader for Large Corpus File #19946

DataLoader for Large Corpus File #19946

zegzag commented Apr 30, 2019

ssnl commented Apr 30, 2019

ssnl commented Jun 21, 2019

rabeehkarimimahabadi commented Nov 7, 2020

DataLoader for Large Corpus File #19946

DataLoader for Large Corpus File #19946

Comments

zegzag commented Apr 30, 2019

🚀 Feature

Motivation

Alternatives

ssnl commented Apr 30, 2019

ssnl commented Jun 21, 2019

rabeehkarimimahabadi commented Nov 7, 2020