Option to run dataloader on single process for distributed training #2935

davidkartchner · 2020-08-12T18:36:08Z

What is your question?

Is there a way to run dataloading on a single process for DDP distributed training? As is, pytorch-lightning creates a different dataloader on each process. While this is fine with a normal dataloader that uses a distributed sampler, I am using an IterableDataset and my batches are duplicated on every GPU, rendering multi-GPU training useless. I would like to be able to sequentially submit batches to each GPU so that they can all simply use batches from the same IterableDataset. Documentation from DistributedDataParallel indicates that this is supported behavior

What have you tried?

I have tried all of the different data utilities offered in pytorch-lightning, including the new LightningDataModule in the latest release. All of them seem to duplicate the dataloader on each GPU, leading to duplicate batches.

What's your environment?

OS: Linux
Packaging: pip
Version: 0.9.0rc2

The text was updated successfully, but these errors were encountered:

github-actions · 2020-08-12T18:36:53Z

Hi! thanks for your contribution!, great first issue!

justusschock · 2020-08-13T08:03:40Z

@davidkartchner Actually, I'm not sure this is possible. At least I'm not aware of any possibility to do this.

What we're currently doing at lightning is to add a distributed sampler, so that the dataset may be the same, but the set of possible indices for the sampler in each process is different to the other processes, which should also cause unique batches.

davidkartchner · 2020-08-13T15:00:51Z

@justusschock Thanks for the reply. I had hoped that this would work in my case, but the distributed sampler doesn't work with an IterableDataset, as noted in the Pytorch-Lightning docs on Multi-GPU training. This because instances of IterableDataset don't allow random access, but rather return an iterator through the data. The main use case for iterable datasets is reading in streaming data, i.e. reading line by line from a file, where random index access isn't available. Performing distributed sampling with IterableDatasets is currently an open issue of development in pytorch (see #28743)

In my case, this is necessary because my text corpus is too large to load into memory, particularly if its duplicated 16 times. Adding this functionality would be very beneficial for NLP applications, especially related to training language models.

davidkartchner · 2020-08-18T13:56:20Z

I managed to fix this issue by creating an index of the character offsets of each line in my training data file. It allows me to read any arbitrary line in the file in constant time but still only read a single line at a time to have the memory footprint of an IterableDataset.

If anyone else runs into this problem, the following had helpful guidelines/code:

Closing this for now.

davidkartchner added the question Further information is requested label Aug 12, 2020

davidkartchner closed this as completed Aug 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to run dataloader on single process for distributed training #2935

Option to run dataloader on single process for distributed training #2935

davidkartchner commented Aug 12, 2020 •

edited

Loading

github-actions bot commented Aug 12, 2020

justusschock commented Aug 13, 2020

davidkartchner commented Aug 13, 2020

davidkartchner commented Aug 18, 2020

Option to run dataloader on single process for distributed training #2935

Option to run dataloader on single process for distributed training #2935

Comments

davidkartchner commented Aug 12, 2020 • edited Loading

What is your question?

What have you tried?

What's your environment?

github-actions bot commented Aug 12, 2020

justusschock commented Aug 13, 2020

davidkartchner commented Aug 13, 2020

davidkartchner commented Aug 18, 2020

davidkartchner commented Aug 12, 2020 •

edited

Loading