Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to run dataloader on single process for distributed training #2935

Closed
davidkartchner opened this issue Aug 12, 2020 · 4 comments
Closed
Labels
question Further information is requested

Comments

@davidkartchner
Copy link

davidkartchner commented Aug 12, 2020

What is your question?

Is there a way to run dataloading on a single process for DDP distributed training? As is, pytorch-lightning creates a different dataloader on each process. While this is fine with a normal dataloader that uses a distributed sampler, I am using an IterableDataset and my batches are duplicated on every GPU, rendering multi-GPU training useless. I would like to be able to sequentially submit batches to each GPU so that they can all simply use batches from the same IterableDataset. Documentation from DistributedDataParallel indicates that this is supported behavior

What have you tried?

I have tried all of the different data utilities offered in pytorch-lightning, including the new LightningDataModule in the latest release. All of them seem to duplicate the dataloader on each GPU, leading to duplicate batches.

What's your environment?

  • OS: Linux
  • Packaging: pip
  • Version: 0.9.0rc2
@davidkartchner davidkartchner added the question Further information is requested label Aug 12, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@justusschock
Copy link
Member

@davidkartchner Actually, I'm not sure this is possible. At least I'm not aware of any possibility to do this.

What we're currently doing at lightning is to add a distributed sampler, so that the dataset may be the same, but the set of possible indices for the sampler in each process is different to the other processes, which should also cause unique batches.

@davidkartchner
Copy link
Author

@justusschock Thanks for the reply. I had hoped that this would work in my case, but the distributed sampler doesn't work with an IterableDataset, as noted in the Pytorch-Lightning docs on Multi-GPU training. This because instances of IterableDataset don't allow random access, but rather return an iterator through the data. The main use case for iterable datasets is reading in streaming data, i.e. reading line by line from a file, where random index access isn't available. Performing distributed sampling with IterableDatasets is currently an open issue of development in pytorch (see #28743)

In my case, this is necessary because my text corpus is too large to load into memory, particularly if its duplicated 16 times. Adding this functionality would be very beneficial for NLP applications, especially related to training language models.

@davidkartchner
Copy link
Author

I managed to fix this issue by creating an index of the character offsets of each line in my training data file. It allows me to read any arbitrary line in the file in constant time but still only read a single line at a time to have the memory footprint of an IterableDataset.

If anyone else runs into this problem, the following had helpful guidelines/code:

Closing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants