MultiGPU Support #211

alugupta · 2018-01-10T22:24:24Z

Hi,

I was wondering if anyone had tried using multiple GPUs with the DeepSpeech models and what their experience was. Currently I am seeing that there is little difference in training time between using 1 or 2 GPUs (maybe 10% improvement if that). When running nvidia-smi I can see multiple GPUs being used so that is not the problem (DataParallel handles this automatically).

Is there something I should look out for in terms of multi-GPU training? I did increase the batchsize when running on multiple GPUs so that the utilization for each GPU is comparable to using 1 GPU in isolation.

Thanks!

Udit

SeanNaren · 2018-01-10T22:37:57Z

I've also seen the same behavior and am unsure on why this is the case, hopefully soon I'll get more time to try figure this out...

ryanleary · 2018-01-16T06:47:57Z

Relevant: OpenNMT/OpenNMT-py#89 (comment)

Should probably experiment with DistributedDataParallel.

SeanNaren · 2018-01-16T09:09:55Z

@ryanleary this is concerning... I'll try to mimic fairseqs implementation and then do benchmark runs. I'm not sure why the hell dataparallel uses threads...

@alugupta I noticed you made a post here some time ago, here you say:

it takes 18 minutes and 6 seconds on 2 GPUs and 18 minutes and 34 seconds on 1 GPU (70 epochs on the smaller an4 dataset). For 1 GPU we used a batchsize of 20 whereas for 2 GPUs we used a batchsize of 40.

Isn't that an acceptable speed increase? Doubling the batch size kept the speeds consistent when using multiple GPUs?

ryanleary · 2018-01-16T15:04:14Z

No, not unless he ran for double the number of epochs in the latter case.

SeanNaren · 2018-01-16T15:37:06Z

oh yeah my bad interpreted that wrong, thanks @ryanleary

alugupta · 2018-01-16T16:52:17Z

Right, what @ryanleary said :) I didn't double the number of epochs in the latter case so there was effectively no speedup (reduction in training time).

I've only tried with 2 GPUs so far, perhaps it scales once you get to 4 or 8 GPUs. Will try to give this a spin soon!

ryanleary · 2018-01-16T17:02:10Z

I suspect if it's the same speed with 2, it'll be as slow or worse (because GIL contention) with 4 or more.

SeanNaren · 2018-01-23T10:20:56Z

I'm not seeing this issue anymore using Pytorch 0.3 and CUDA 9 on a G3 instance from AWS with 2 GPUs:

AN4 epoch times:

GPUs	Batch Size	Epoch Time (s)
1	20	51s
2	40	30s

Will try scaling up further and check if the benefits disappear

SeanNaren · 2018-01-29T18:37:51Z

So upon further benchmarking I'm sure the results you got were due to AN4 being very small. On larger dataset I do see scaling albeit not as fast as I'd like. I'm going to keep this ticket open because I want to provide benchmarks around V100s using NCCL2 etc

alugupta · 2018-01-29T22:36:51Z

Hi!

Thanks for this. I'll try to run some experiments on the larger datasets and see if I can see some scaling.
Just curious but how much speedup did you see on the larger datasets? Also did you increase the number of data workers? I've been trying with using either 1, 2, or 4 times the number of GPUs.

Thanks!

SeanNaren · 2018-02-02T14:22:28Z

Did some benchmarking on librispeech_clean_100 (100 hours of libri), and then using the single GPU epoch time as the baseline to compare 2/4/8 GPU times. I used pytorch 0.3, with CUDA 9.1.

Using the distributed branch I start N train scripts, with a separate train script for each GPU.

Below are the graphs using data parallel, and then distributed data parallel on the distributed branch:

From this it's clear to get the speedup on p3.x16large instances (V100 cards) we need to use distributed data parallel. Any more thoughts on this please let me know!

If someone knows of a nice way to launch N copies of the training script automatically, please let me know since this is needed for distributed pytorch to work.

apaszke · 2018-02-02T14:30:35Z

It's a known fact, NVIDIA has already published numbers similar to those (although they are worse likely because they used 0.2 and a lot has been improved since then). This is a nice starting point (also courtesy of NVIDIA) for a script that lets you start multiple DDP processes quite easily. We're planning on integrating a similar version into mainline PyTorch.

xhzhao · 2018-02-12T04:42:15Z

@SeanNaren Id like to share the 8xp100 scalability data with the default model on librispeech_clean_100, and i got the following result:

DS2 Training Performance	8xP100-bs32	P100-bs4
IterAverage(s)	1.17	0.60
EpochTime(hour)	0.29	1.19
TTT(hour)	4.36	17.90

From this performance data, we get about 51.3% scalability with P100, which match with your p3.x16 large.
BTY, how could i get a better scalability with DDP? just replace "nn.DataParallel" with "nn.DistributedDataParallel" ?

SeanNaren · 2018-02-12T18:31:30Z

Hey @xhzhao there is a branch called distributed, using the multiproc.py script you can scale training onto all GPUs with a separate process. Current away from my PC for a few days once I'm back I can give better instructions!

SeanNaren · 2018-02-22T16:47:27Z

I've just merged a branch using the distributed wrapper for multi-gpu. Not sure if you're still using the package, but @alugupta would be nice for your to retry! Again AN4 is a small dataset, would suggest like librispeech for a nice comparison

alugupta mentioned this issue Jan 11, 2018

DS2 accuracy for librispeech #214

Closed

SeanNaren added the help wanted label Jan 12, 2018

ryanleary added the enhancement label Jan 18, 2018

SeanNaren closed this as completed Jan 23, 2018

SeanNaren reopened this Jan 23, 2018

SeanNaren mentioned this issue Feb 17, 2018

Added multi-gpu support via distributed wrapper #252

Merged

SeanNaren closed this as completed Feb 22, 2018

miguelvr mentioned this issue Jul 13, 2018

Multi Gpus with pytorch backend problem. espnet/espnet#279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiGPU Support #211

MultiGPU Support #211

alugupta commented Jan 10, 2018

SeanNaren commented Jan 10, 2018

ryanleary commented Jan 16, 2018

SeanNaren commented Jan 16, 2018 •

edited

Loading

ryanleary commented Jan 16, 2018

SeanNaren commented Jan 16, 2018

alugupta commented Jan 16, 2018

ryanleary commented Jan 16, 2018

SeanNaren commented Jan 23, 2018 •

edited

Loading

SeanNaren commented Jan 29, 2018

alugupta commented Jan 29, 2018

SeanNaren commented Feb 2, 2018 •

edited

Loading

apaszke commented Feb 2, 2018 •

edited

Loading

xhzhao commented Feb 12, 2018 •

edited

Loading

SeanNaren commented Feb 12, 2018

SeanNaren commented Feb 22, 2018

MultiGPU Support #211

MultiGPU Support #211

Comments

alugupta commented Jan 10, 2018

SeanNaren commented Jan 10, 2018

ryanleary commented Jan 16, 2018

SeanNaren commented Jan 16, 2018 • edited Loading

ryanleary commented Jan 16, 2018

SeanNaren commented Jan 16, 2018

alugupta commented Jan 16, 2018

ryanleary commented Jan 16, 2018

SeanNaren commented Jan 23, 2018 • edited Loading

SeanNaren commented Jan 29, 2018

alugupta commented Jan 29, 2018

SeanNaren commented Feb 2, 2018 • edited Loading

apaszke commented Feb 2, 2018 • edited Loading

xhzhao commented Feb 12, 2018 • edited Loading

SeanNaren commented Feb 12, 2018

SeanNaren commented Feb 22, 2018

SeanNaren commented Jan 16, 2018 •

edited

Loading

SeanNaren commented Jan 23, 2018 •

edited

Loading

SeanNaren commented Feb 2, 2018 •

edited

Loading

apaszke commented Feb 2, 2018 •

edited

Loading

xhzhao commented Feb 12, 2018 •

edited

Loading