Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IterableDataset #19228

Closed
wants to merge 18 commits into from
Closed

Add IterableDataset #19228

wants to merge 18 commits into from

Conversation

ssnl
Copy link
Collaborator

@ssnl ssnl commented Apr 13, 2019

This is a modified version of #14705 since commit structure for that PR is quite messy.

  1. Add IterableDataset.

  2. So we have 2 data loader mods: Iterable and Map.

    1. Iterable if the dataset is an instance of IterableDataset
    2. Map o.w.
  3. Add better support for non-batch loading (i.e., batch_size=None and batch_sampler=None). This is useful in doing things like bulk loading.

  4. Refactor DataLoaderIter into two classes, _SingleProcessDataLoaderIter and _MultiProcessingDataLoaderIter. Rename some methods to be more generic, e.g., get_batch -> get_data.

  5. Add torch.utils.data.get_worker_info which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in IterableDataset.__iter__ and worker_init_fn to do per-worker configuration.

  6. Add ChainDataset, which is the analog of ConcatDataset for IterableDataset.

  7. Import torch.utils.data in torch/__init__.py

  8. data loader examples and documentations

  9. Use get_worker_info to detect whether we are in a worker process in default_collate

Closes #17909, #18096, #19946, and some of #13023

@ssnl ssnl force-pushed the dl_no_sampler_test branch 3 times, most recently from 7c9e030 to 746065b Compare April 17, 2019 00:29
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritamdamania87 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ssnl
Copy link
Collaborator Author

ssnl commented Apr 24, 2019

this is currently awaiting (1) #19421 and (2) filling in the docs

@ssnl ssnl force-pushed the dl_no_sampler_test branch 3 times, most recently from 922cfc3 to fc53d79 Compare April 24, 2019 17:23
docs/source/data.rst Show resolved Hide resolved
torch/utils/data/_utils/__init__.py Outdated Show resolved Hide resolved
torch/utils/data/dataloader.py Outdated Show resolved Hide resolved
torch/utils/data/_utils/signal_handling.py Show resolved Hide resolved
torch/utils/data/_utils/worker.py Show resolved Hide resolved
docs/source/data.rst Outdated Show resolved Hide resolved
torch/utils/data/dataloader.py Outdated Show resolved Hide resolved
torch/utils/data/dataset.py Show resolved Hide resolved
@ssnl
Copy link
Collaborator Author

ssnl commented Apr 29, 2019

@pritamdamania87 doc is done!

@ssnl ssnl force-pushed the dl_no_sampler_test branch 2 times, most recently from 42a783d to e50f65f Compare April 29, 2019 15:33
@pritamdamania87
Copy link
Contributor

@apaszke Soumith mentioned it'd be nice to have you review parts of this PR. I've looked through some of the new functionality added in this PR (IterableDataset), but it would be good if you can take a look at some of the changes to the existing code structure. Thanks!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritamdamania87 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ssnl
Copy link
Collaborator Author

ssnl commented Apr 29, 2019

@pritamdamania87 do let me know if this breaks anything internal

@pritamdamania87
Copy link
Contributor

pritamdamania87 commented Apr 30, 2019

@pritamdamania87 do let me know if this breaks anything internal

Looks like renaming pin_memory_batch breaks a bunch of stuff which does from torch.utils.data._utils.pin_memory import pin_memory_batch. Also, it looks like I see a lot of errors coming from child processes: TypeError: unhashable type: 'list'. Not sure what might be the source of those though.

@ssnl
Copy link
Collaborator Author

ssnl commented Apr 30, 2019

@pritamdamania87 for the first error, I prefer fixing those who import private helpers. For the second, it’d be great if you can show/message me the trace, or at least the non confidential parts. Thanks!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritamdamania87 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

pull bot pushed a commit to Pandinosaurus/pytorch that referenced this pull request Jun 21, 2019
Summary:
This is a modified version of pytorch#14705 since commit structure for that PR is quite messy.

1. Add `IterableDataset`.
3. So we have 2 data loader mods: `Iterable` and `Map`.

    1. `Iterable` if the `dataset` is an instance of `IterableDataset`
    2. `Map` o.w.

3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading.
3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`.
4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration.
5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`.
7. Import torch.utils.data in `torch/__init__.py`
9. data loader examples and documentations
10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate`

Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023
Pull Request resolved: pytorch#19228

Reviewed By: bddppq

Differential Revision: D15058152

fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
@ssnl ssnl deleted the dl_no_sampler_test branch June 21, 2019 03:27
iotamudelta pushed a commit to ROCm/pytorch that referenced this pull request Jun 21, 2019
Summary:
This is a modified version of pytorch#14705 since commit structure for that PR is quite messy.

1. Add `IterableDataset`.
3. So we have 2 data loader mods: `Iterable` and `Map`.

    1. `Iterable` if the `dataset` is an instance of `IterableDataset`
    2. `Map` o.w.

3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading.
3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`.
4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration.
5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`.
7. Import torch.utils.data in `torch/__init__.py`
9. data loader examples and documentations
10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate`

Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023
Pull Request resolved: pytorch#19228

Reviewed By: bddppq

Differential Revision: D15058152

fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
@bethunebtj
Copy link

Seems like the PR is finished? Good job!
Btw, when will this be released? I'm looking forward to this. Again, thanks a lot!

@dzhulgakov
Copy link
Collaborator

@bethunebtj - we're targeting 1.2 release at the end of July / early August hopefully. But you can always use nightlies :)

@rfalcon100
Copy link

Is this functionality available in nightly?

@soumith
Copy link
Member

soumith commented Jul 18, 2019

@rfalcon100 yes

This was referenced Sep 20, 2019
facebook-github-bot pushed a commit that referenced this pull request Oct 8, 2019
Summary:
Back in April, malmaud added type annotations for `dataloader.py`. However, at about the same time, SsnL in #19228 replaced `_DataLoaderIter` with `_BaseDataLoaderIter` and two subclasses, `_SingleProcessDataLoaderIter`, and `_MultiProcessingDataLoaderIter`. However - probably because these changes happened in parallel at roughly the same time, the type stubs and several other references in the codebase were never updated to match this refactoring.

I've gone ahead and done the updates to reflect the refactoring in #19228, which fixes the specific type stub/impelementation mismatch pointed out in #26673, although not the broader problem that pytorch doesn't have a test to make sure that the `.pyi` type stub files match the real API defined in `.py` files.
Pull Request resolved: #27105

Differential Revision: D17813641

Pulled By: ezyang

fbshipit-source-id: ed7ac025c8d6ad3f298dd073347ec83bb4b6600c
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
…ch#27105)

Summary:
Back in April, malmaud added type annotations for `dataloader.py`. However, at about the same time, SsnL in pytorch#19228 replaced `_DataLoaderIter` with `_BaseDataLoaderIter` and two subclasses, `_SingleProcessDataLoaderIter`, and `_MultiProcessingDataLoaderIter`. However - probably because these changes happened in parallel at roughly the same time, the type stubs and several other references in the codebase were never updated to match this refactoring.

I've gone ahead and done the updates to reflect the refactoring in pytorch#19228, which fixes the specific type stub/impelementation mismatch pointed out in pytorch#26673, although not the broader problem that pytorch doesn't have a test to make sure that the `.pyi` type stub files match the real API defined in `.py` files.
Pull Request resolved: pytorch#27105

Differential Revision: D17813641

Pulled By: ezyang

fbshipit-source-id: ed7ac025c8d6ad3f298dd073347ec83bb4b6600c
@kuraga
Copy link
Contributor

kuraga commented Mar 21, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler module: docs Related to our documentation, both in docs/ and docblocks module: internals Related to internal abstractions in c10 and ATen module: multiprocessing Related to torch.multiprocessing open source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

custome collate function cannot get _use_shared_memory