Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiGPU Support #211

Closed
alugupta opened this issue Jan 10, 2018 · 15 comments
Closed

MultiGPU Support #211

alugupta opened this issue Jan 10, 2018 · 15 comments

Comments

@alugupta
Copy link

Hi,

I was wondering if anyone had tried using multiple GPUs with the DeepSpeech models and what their experience was. Currently I am seeing that there is little difference in training time between using 1 or 2 GPUs (maybe 10% improvement if that). When running nvidia-smi I can see multiple GPUs being used so that is not the problem (DataParallel handles this automatically).

Is there something I should look out for in terms of multi-GPU training? I did increase the batchsize when running on multiple GPUs so that the utilization for each GPU is comparable to using 1 GPU in isolation.

Thanks!

Udit

@SeanNaren
Copy link
Owner

I've also seen the same behavior and am unsure on why this is the case, hopefully soon I'll get more time to try figure this out...

@ryanleary
Copy link
Collaborator

Relevant: OpenNMT/OpenNMT-py#89 (comment)

Should probably experiment with DistributedDataParallel.

@SeanNaren
Copy link
Owner

SeanNaren commented Jan 16, 2018

@ryanleary this is concerning... I'll try to mimic fairseqs implementation and then do benchmark runs. I'm not sure why the hell dataparallel uses threads...

@alugupta I noticed you made a post here some time ago, here you say:

it takes 18 minutes and 6 seconds on 2 GPUs and 18 minutes and 34 seconds on 1 GPU (70 epochs on the smaller an4 dataset). For 1 GPU we used a batchsize of 20 whereas for 2 GPUs we used a batchsize of 40.

Isn't that an acceptable speed increase? Doubling the batch size kept the speeds consistent when using multiple GPUs?

@ryanleary
Copy link
Collaborator

No, not unless he ran for double the number of epochs in the latter case.

@SeanNaren
Copy link
Owner

oh yeah my bad interpreted that wrong, thanks @ryanleary

@alugupta
Copy link
Author

Right, what @ryanleary said :) I didn't double the number of epochs in the latter case so there was effectively no speedup (reduction in training time).

I've only tried with 2 GPUs so far, perhaps it scales once you get to 4 or 8 GPUs. Will try to give this a spin soon!

@ryanleary
Copy link
Collaborator

I suspect if it's the same speed with 2, it'll be as slow or worse (because GIL contention) with 4 or more.

@SeanNaren
Copy link
Owner

SeanNaren commented Jan 23, 2018

I'm not seeing this issue anymore using Pytorch 0.3 and CUDA 9 on a G3 instance from AWS with 2 GPUs:

AN4 epoch times:

GPUs Batch Size Epoch Time (s)
1 20 51s
2 40 30s

Will try scaling up further and check if the benefits disappear

@SeanNaren SeanNaren reopened this Jan 23, 2018
@SeanNaren
Copy link
Owner

So upon further benchmarking I'm sure the results you got were due to AN4 being very small. On larger dataset I do see scaling albeit not as fast as I'd like. I'm going to keep this ticket open because I want to provide benchmarks around V100s using NCCL2 etc

@alugupta
Copy link
Author

Hi!

Thanks for this. I'll try to run some experiments on the larger datasets and see if I can see some scaling.
Just curious but how much speedup did you see on the larger datasets? Also did you increase the number of data workers? I've been trying with using either 1, 2, or 4 times the number of GPUs.

Thanks!

@SeanNaren
Copy link
Owner

SeanNaren commented Feb 2, 2018

Did some benchmarking on librispeech_clean_100 (100 hours of libri), and then using the single GPU epoch time as the baseline to compare 2/4/8 GPU times. I used pytorch 0.3, with CUDA 9.1.

Using the distributed branch I start N train scripts, with a separate train script for each GPU.

Below are the graphs using data parallel, and then distributed data parallel on the distributed branch:

screen shot 2018-02-02 at 14 20 14

screen shot 2018-02-02 at 14 20 08

From this it's clear to get the speedup on p3.x16large instances (V100 cards) we need to use distributed data parallel. Any more thoughts on this please let me know!

If someone knows of a nice way to launch N copies of the training script automatically, please let me know since this is needed for distributed pytorch to work.

@apaszke
Copy link

apaszke commented Feb 2, 2018

It's a known fact, NVIDIA has already published numbers similar to those (although they are worse likely because they used 0.2 and a lot has been improved since then). This is a nice starting point (also courtesy of NVIDIA) for a script that lets you start multiple DDP processes quite easily. We're planning on integrating a similar version into mainline PyTorch.

@xhzhao
Copy link

xhzhao commented Feb 12, 2018

@SeanNaren Id like to share the 8xp100 scalability data with the default model on librispeech_clean_100, and i got the following result:

DS2 Training Performance 8xP100-bs32 P100-bs4
IterAverage(s) 1.17 0.60
EpochTime(hour) 0.29 1.19
TTT(hour) 4.36 17.90

From this performance data, we get about 51.3% scalability with P100, which match with your p3.x16 large.
BTY, how could i get a better scalability with DDP? just replace "nn.DataParallel" with "nn.DistributedDataParallel" ?

@SeanNaren
Copy link
Owner

Hey @xhzhao there is a branch called distributed, using the multiproc.py script you can scale training onto all GPUs with a separate process. Current away from my PC for a few days once I'm back I can give better instructions!

@SeanNaren
Copy link
Owner

I've just merged a branch using the distributed wrapper for multi-gpu. Not sure if you're still using the package, but @alugupta would be nice for your to retry! Again AN4 is a small dataset, would suggest like librispeech for a nice comparison

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants