[MKL-DNN] MKL-DNN RNN backward path enhancement #17183

zixuanweeei · 2019-12-27T07:31:48Z

Description

Mainly aims to fix the gradients' explosion reported by #17086. Other enhancements are also enabled, which are listed in the Changes section.

Checklist

Changes

Flush memory before RNN backward primitive
Add gluon-rnn unit test for gradients check
Cache reorder
Re-write rnn supporting check
Update OpSignature.AddSign to avoid potential hash collision for rnn-packed memory

Comments

Backward performance should have been improved to some degree as we cached the reorder primitives. A performance comparison will be posted later. Besides, this patch is supposed to have no impact on the Forward path.

@ciyongch @TaoLv @pengzhao-intel

zixuanweeei · 2019-12-28T01:09:55Z

We compared the performance between 1cfaf3c and this PR (c91c9e1). The result of the performance test on LSTM and GRU is shown below. It can be inferred from the promotion rate that this patch has improved the performance of Backward. The performance of some forward paths has decreased significantly, which requires more investigation.

num_layer=1, batch_size=64, seq_len=50, input_dim=512

directions	cell_type	hidden_size	Forward Gap	Backward Gap
1	lstm	512	-2.69%	7.17%
1	lstm	1024	-2.65%	10.02%
2	lstm	512	-3.13%	7.11%
2	lstm	1024	-1.44%	10.75%
1	gru	512	0.26%	9.63%
1	gru	1024	-2.91%	7.84%
2	gru	512	-2.33%	6.36%
2	gru	1024	-1.65%	4.90%
1	rnn_relu	512	-9.08%	7.08%
1	rnn_relu	1024	-7.52%	8.63%
2	rnn_relu	512	-6.93%	6.39%
2	rnn_relu	1024	-3.88%	8.68%
1	rnn_tanh	512	-8.68%	11.87%
1	rnn_tanh	1024	-7.71%	10.42%
2	rnn_tanh	512	-10.97%	4.60%
2	rnn_tanh	1024	-5.46%	13.53%

TaoLv · 2020-01-03T04:33:03Z

@zixuanweeei Please remove [WIP] from the title and retrigger CI. @ciyongch please help to review. Thanks!

ciyongch · 2020-01-03T07:20:35Z

@zixuanweeei Do you have any clue about the perf degradeation on forward pass? As the PR also done some refactor to the common code of forward and backward.

zixuanweeei · 2020-01-03T10:31:52Z

@zixuanweeei Do you have any clue about the perf degradeation on forward pass? As the PR also done some refactor to the common code of forward and backward.

Another perf test containing more warm-up loops gave the result below, which compared the performance between 89fe1f6 and 622d843. The refactor that have some impact on the forward pass is related to reorder. But reorder is only executed in the initialization process at the first iteration. RNN primitives may take more time to become stabilized. I think we can get this PR into master at first for settling the problem of gradients explode and release 1.6.x.

directions	cell_type	layer_num	hidden_size	Fwd Gap	Bwd Gap
1	lstm	1	512	-0.09%	13.40%
1	lstm	1	1024	0.93%	8.47%
2	lstm	1	512	0.24%	8.03%
2	lstm	1	1024	0.44%	21.48%
1	gru	1	512	0.40%	11.57%
1	gru	1	1024	-0.81%	12.09%
2	gru	1	512	0.21%	7.71%
2	gru	1	1024	1.91%	5.63%
1	rnn_relu	1	512	2.03%	18.97%
1	rnn_relu	1	1024	10.51%	10.48%
2	rnn_relu	1	512	1.51%	9.22%
2	rnn_relu	1	1024	-0.42%	9.12%
1	rnn_tanh	1	512	2.87%	5.28%
1	rnn_tanh	1	1024	0.47%	8.09%
2	rnn_tanh	1	512	0.96%	6.88%
2	rnn_tanh	1	1024	0.27%	9.15%

ciyongch

LGTM

* Flush memory before RNN backward primitive * Add gluon rnn unit test for gradients check * Cache reorder * Re-write rnn supporting check * Update OpSignature.AddSign to avoid potential hash collision for rnn-packed memory Get the data type from mkldnn memory descriptor when setting grad handle

zixuanweeei · 2020-01-06T01:32:04Z

Thanks, @ciyongch .

@TaoLv @pengzhao-intel Please take a review again. I rebased the current branch on the official master, so we can merge it if CI passes.

pengzhao-intel

LGTM
Will merge the PR after CI pass.

* Flush memory before RNN backward primitive * Add gluon rnn unit test for gradients check * Cache reorder * Re-write rnn supporting check * Update OpSignature.AddSign to avoid potential hash collision for rnn-packed memory Get the data type from mkldnn memory descriptor when setting grad handle

) * [MKLDNN] mkldnn RNN operator enhancement (#17075) * mkldnn rnn operator enhancement `add` operation support Rename AddTo Add MXNET_USE_MKLDNN_RNN env Add Env var for switching to naive RNN impl and naive add/copy impl * Re-run CI, op:test_reduce failed on Unix-CPU * Rerun CI, Python2 CPU on Unix-CPU timeout * MKL-DNN RNN backward path enhancement (#17183) * Flush memory before RNN backward primitive * Add gluon rnn unit test for gradients check * Cache reorder * Re-write rnn supporting check * Update OpSignature.AddSign to avoid potential hash collision for rnn-packed memory Get the data type from mkldnn memory descriptor when setting grad handle

zixuanweeei mentioned this pull request Dec 27, 2019

[MKLDNN] RNN Op gradient computation is broken #17086

Closed

pengzhao-intel added the MKLDNN label Dec 27, 2019

pengzhao-intel added this to In progress in CPU Performance and Quantization via automation Dec 27, 2019

leezu added the R1.6.0 label Jan 2, 2020

liuzh47 approved these changes Jan 2, 2020

View reviewed changes

zixuanweeei changed the title ~~[MKL-DNN][WIP] MKL-DNN RNN backward path enhancement~~ [MKL-DNN] MKL-DNN RNN backward path enhancement Jan 3, 2020

zixuanweeei force-pushed the rnn/gradients branch from 100d36c to 622d843 Compare January 3, 2020 05:48

ciyongch approved these changes Jan 6, 2020

View reviewed changes

zixuanweeei force-pushed the rnn/gradients branch from 622d843 to d1d708d Compare January 6, 2020 01:26

CPU Performance and Quantization automation moved this from In progress to Reviewer approved Jan 6, 2020

pengzhao-intel approved these changes Jan 6, 2020

View reviewed changes

pengzhao-intel merged commit 83a23b0 into apache:master Jan 6, 2020

CPU Performance and Quantization automation moved this from Reviewer approved to Done Jan 6, 2020

zixuanweeei mentioned this pull request Jan 6, 2020

[v1.6.x] Cherry-pick MKL-DNN Rnn operator enhancements to v1.6.x #17225

Merged

zixuanweeei deleted the rnn/gradients branch March 12, 2020 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MKL-DNN] MKL-DNN RNN backward path enhancement #17183

[MKL-DNN] MKL-DNN RNN backward path enhancement #17183

zixuanweeei commented Dec 27, 2019

zixuanweeei commented Dec 28, 2019

TaoLv commented Jan 3, 2020

ciyongch commented Jan 3, 2020

zixuanweeei commented Jan 3, 2020 •

edited

Loading

ciyongch left a comment

zixuanweeei commented Jan 6, 2020

pengzhao-intel left a comment

[MKL-DNN] MKL-DNN RNN backward path enhancement #17183

[MKL-DNN] MKL-DNN RNN backward path enhancement #17183

Conversation

zixuanweeei commented Dec 27, 2019

Description

Checklist

Changes

Comments

zixuanweeei commented Dec 28, 2019

TaoLv commented Jan 3, 2020

ciyongch commented Jan 3, 2020

zixuanweeei commented Jan 3, 2020 • edited Loading

ciyongch left a comment

Choose a reason for hiding this comment

zixuanweeei commented Jan 6, 2020

pengzhao-intel left a comment

Choose a reason for hiding this comment

zixuanweeei commented Jan 3, 2020 •

edited

Loading