-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to run dataloader on single process for distributed training #2935
Comments
Hi! thanks for your contribution!, great first issue! |
@davidkartchner Actually, I'm not sure this is possible. At least I'm not aware of any possibility to do this. What we're currently doing at lightning is to add a distributed sampler, so that the dataset may be the same, but the set of possible indices for the sampler in each process is different to the other processes, which should also cause unique batches. |
@justusschock Thanks for the reply. I had hoped that this would work in my case, but the distributed sampler doesn't work with an IterableDataset, as noted in the Pytorch-Lightning docs on Multi-GPU training. This because instances of IterableDataset don't allow random access, but rather return an iterator through the data. The main use case for iterable datasets is reading in streaming data, i.e. reading line by line from a file, where random index access isn't available. Performing distributed sampling with IterableDatasets is currently an open issue of development in pytorch (see #28743) In my case, this is necessary because my text corpus is too large to load into memory, particularly if its duplicated 16 times. Adding this functionality would be very beneficial for NLP applications, especially related to training language models. |
I managed to fix this issue by creating an index of the character offsets of each line in my training data file. It allows me to read any arbitrary line in the file in constant time but still only read a single line at a time to have the memory footprint of an IterableDataset. If anyone else runs into this problem, the following had helpful guidelines/code: Closing this for now. |
What is your question?
Is there a way to run dataloading on a single process for DDP distributed training? As is, pytorch-lightning creates a different dataloader on each process. While this is fine with a normal dataloader that uses a distributed sampler, I am using an
IterableDataset
and my batches are duplicated on every GPU, rendering multi-GPU training useless. I would like to be able to sequentially submit batches to each GPU so that they can all simply use batches from the same IterableDataset. Documentation from DistributedDataParallel indicates that this is supported behaviorWhat have you tried?
I have tried all of the different data utilities offered in pytorch-lightning, including the new LightningDataModule in the latest release. All of them seem to duplicate the dataloader on each GPU, leading to duplicate batches.
What's your environment?
The text was updated successfully, but these errors were encountered: