Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLoader for Large Corpus File #19946

Closed
zegzag opened this issue Apr 30, 2019 · 3 comments
Closed

DataLoader for Large Corpus File #19946

zegzag opened this issue Apr 30, 2019 · 3 comments
Assignees
Labels
feature A request for a proper, new feature. module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@zegzag
Copy link

zegzag commented Apr 30, 2019

🚀 Feature

  DataLoader for large corpus file, supporting multi-process and multi-thread and memory optimization

Motivation

  Thanks a lot for your pytorch framework, I've benefited a lot in work by using it. But I also got some little confusions with regard to data loading and processing.
  In NLP cases, suppose I have a very large corpus file with labeled data "corpus.csv", which is too large to load it into memery once time, or even it is infinite large. I want to train my model on this dataset from top to bottom by mini batches. That means in each batch, I yield certain text lines from file begining to file ending to train my model (cannot support shuffle). I can do this with python generator easily. However, I don't know how to write subclass of torch.utils.data.Dataset and use torch.utils.data.DataLoader to customize my own dataset on "corpus.csv". Since "corpus.csv" cannot be loaded into memory totally, I can not write __getitem__ and __len__ attribute.
  I would be grateful if you could develop a module that satisfy the cases above ? Since merely by python generator, it is too hard for me to write DataLoader supporting multi-process and multi-thread loading and memory optimization.

Alternatives

  By multi processing, one can be training the model on this batch and loading the next batch at the same time.

@ssnl
Copy link
Collaborator

ssnl commented Apr 30, 2019

This is definitely a valid feature request. In fact, I implemented something called IterableDataset that will be used as an iterable (e.g., generator, data stream) in PyTorch. It is currently being reviewed at #19228 .

@ssnl ssnl self-assigned this Apr 30, 2019
@ssnl ssnl added the module: dataloader Related to torch.utils.data.DataLoader and Sampler label Apr 30, 2019
@gchanan gchanan added feature A request for a proper, new feature. triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 1, 2019
facebook-github-bot pushed a commit to facebookresearch/ReAgent that referenced this issue Jun 21, 2019
Summary:
This is a modified version of pytorch/pytorch#14705 since commit structure for that PR is quite messy.

1. Add `IterableDataset`.
3. So we have 2 data loader mods: `Iterable` and `Map`.

    1. `Iterable` if the `dataset` is an instance of `IterableDataset`
    2. `Map` o.w.

3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading.
3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`.
4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration.
5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`.
7. Import torch.utils.data in `torch/__init__.py`
9. data loader examples and documentations
10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate`

Closes pytorch/pytorch#17909, pytorch/pytorch#18096, pytorch/pytorch#19946, and some of pytorch/pytorch#13023
Pull Request resolved: pytorch/pytorch#19228

Reviewed By: bddppq

Differential Revision: D15058152

fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
pull bot pushed a commit to Pandinosaurus/pytorch that referenced this issue Jun 21, 2019
Summary:
This is a modified version of pytorch#14705 since commit structure for that PR is quite messy.

1. Add `IterableDataset`.
3. So we have 2 data loader mods: `Iterable` and `Map`.

    1. `Iterable` if the `dataset` is an instance of `IterableDataset`
    2. `Map` o.w.

3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading.
3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`.
4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration.
5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`.
7. Import torch.utils.data in `torch/__init__.py`
9. data loader examples and documentations
10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate`

Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023
Pull Request resolved: pytorch#19228

Reviewed By: bddppq

Differential Revision: D15058152

fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
@ssnl
Copy link
Collaborator

ssnl commented Jun 21, 2019

closed via #19228

@ssnl ssnl closed this as completed Jun 21, 2019
iotamudelta pushed a commit to ROCm/pytorch that referenced this issue Jun 21, 2019
Summary:
This is a modified version of pytorch#14705 since commit structure for that PR is quite messy.

1. Add `IterableDataset`.
3. So we have 2 data loader mods: `Iterable` and `Map`.

    1. `Iterable` if the `dataset` is an instance of `IterableDataset`
    2. `Map` o.w.

3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading.
3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`.
4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration.
5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`.
7. Import torch.utils.data in `torch/__init__.py`
9. data loader examples and documentations
10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate`

Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023
Pull Request resolved: pytorch#19228

Reviewed By: bddppq

Differential Revision: D15058152

fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
@rabeehkarimimahabadi
Copy link

Hi
Could you give an example how to use iterable datasets with distributed sampler to be able to efficiently sample datasets over multiple TPU cores? thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A request for a proper, new feature. module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants