[Dependency Update] Upgrade cuDNN & NCCL #14988

stu1130 · 2019-05-17T18:14:48Z

Description

Since the CI have upgraded to use cuDNN 7.5.1 (#14950) , we can upgrade the CUDA 9.0/9.2/10.0 with latest cuDNN 7.5.1 & NCCL 2.4.2
@perdasilva please check it

Checklist

Run three models ResNet50 with ImageNet & LSTM with PTB & MLP with MNIST
Performance shown below
Environment: P3.16xlarge Deep Learning Base AMI
Codebase: commit 1540a84
I also applied the #14837 PR change
The unit of thoughput is samples/per second
Each throughput is calcuated by average of 5 runs

ResNet

model: Resnet50
dataset: Imagenet
number of gpu: 8
epochs: 3 (only to test throughput)
preprocess command: sudo pip install gluoncv==0.2.0b20180625
command: python mxnet_benchmark/train_imagenet.py --use-rec --batch-size 128 --dtype float32 —num-data-workers 40 —num-epochs 3 —gpus 0,1,2,3,4,5,6,7 --lr 0.05 --last-gamma —mode symbolic —model resnet50_v1b —rec-train /home/ubuntu/data/train-passthrough.rec —rec-train-idx /home/ubuntu/data/train-passthrough.idx —rec-val /home/ubuntu/data/val-passthrough.rec —rec-val-idx /home/ubuntu/data/val-passthrough.idx
github repo: https://github.com/rahul003/deep-learning-benchmark-mirror.git*

Throughput Tables	cuDNN 7.5.1/NCCL 2.4.2	cuDNN 7.3.1/NCCL 2.3.4	Perforamnce Difference
CUDA 10	2831.54405	2821.9832	0.339%
CUDA 9.2	2832.36803	2843.28968	-0.384%
CUDA 9.0	2815.83939	2851.92915	-1.265%

**There is another performance regression with --batch-size 256 --dtype float16 --mode hybrid, please find more details on #14838

LSTM

model: LSTM
dataset: PTB(Penn Treebank)
number of gpu: 1
epochs: 10
command:
python2 benchmark_driver.py --framework mxnet --task-name mkl_lstm_ptb_symbolic --num-gpus 1 --epochs 10 --metrics-suffix test --kvstore local
python word_language_model/lstm_bucketing.py —num-hidden 650 —num-embed 650 —gpus 0 --epochs 10 --kv-store local

Throughput Tables	cuDNN 7.5.1/NCCL 2.4.2	cuDNN 7.3.1/NCCL 2.3.4	Perforamnce Difference
CUDA 10	847.98222	868.28966	-2.339%
CUDA 9.2	1005.25185	1051.06692	-4.359%
CUDA 9.0	1002.59081	1028.46962	-1.265%

The CUDA 10 have a performance regression issue, please see #14725 to find more details.

MLP

model: 3 dense layers with num_hidden=64 and relu as activation
dataset: MNIST
number of gpu: 1
epochs: 10
command:
python2 benchmark_runner.py —framework mxnet —metrics-policy mlp —task-name mlp —metrics-suffix test —num-gpus 1 —command-to-execute 'python3 mlp.py' —data-set mnist

Throughput Tables	cuDNN 7.5.1/NCCL 2.4.2	cuDNN 7.3.1/NCCL 2.3.4	Perforamnce Difference
CUDA 10	4638.73873	4500.7834	3.065%
CUDA 9.2	4425.37599	4540.29583	-2.531%
CUDA 9.0	4421.82611	4427.43356	-0.127%

Comments

@szha @lanking520 @eric-haibin-lin @perdasilva

perdasilva

LGTM - thank you for all your efforts getting CI to cuda v10.1 and the latest cudnn - very nice indeed!

pinaraws · 2019-05-20T16:07:58Z

@mxnet-label-bot add[CI, pr-awaiting-merge]

stu1130 requested a review from szha as a code owner May 17, 2019 18:14

stu1130 changed the title ~~[Dependency Update] Upgrade cuDNN & NCCL~~ [WIP][Dependency Update] Upgrade cuDNN & NCCL May 17, 2019

stu1130 changed the title ~~[WIP][Dependency Update] Upgrade cuDNN & NCCL~~ [Dependency Update] Upgrade cuDNN & NCCL May 17, 2019

stu1130 force-pushed the bump_up_cudnn_to_7_5_1 branch 2 times, most recently from e637967 to d27477f Compare May 19, 2019 06:22

bump up cudnn to 7.5.1 & nccl 2.4.2

bf239ce

stu1130 force-pushed the bump_up_cudnn_to_7_5_1 branch from d27477f to bf239ce Compare May 20, 2019 00:18

perdasilva approved these changes May 20, 2019

View reviewed changes

marcoabreu added CI pr-awaiting-merge Review and CI is complete. Ready to Merge labels May 20, 2019

szha merged commit ace478f into apache:master May 20, 2019

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019

bump up cudnn to 7.5.1 & nccl 2.4.2 (apache#14988)

bdfeb13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dependency Update] Upgrade cuDNN & NCCL #14988

[Dependency Update] Upgrade cuDNN & NCCL #14988

stu1130 commented May 17, 2019 •

edited

Loading

perdasilva left a comment •

edited

Loading

pinaraws commented May 20, 2019

[Dependency Update] Upgrade cuDNN & NCCL #14988

[Dependency Update] Upgrade cuDNN & NCCL #14988

Conversation

stu1130 commented May 17, 2019 • edited Loading

Description

Checklist

ResNet

LSTM

MLP

Comments

perdasilva left a comment • edited Loading

Choose a reason for hiding this comment

pinaraws commented May 20, 2019

stu1130 commented May 17, 2019 •

edited

Loading

perdasilva left a comment •

edited

Loading