Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MKL-DNN] MKL-DNN RNN backward path enhancement #17183

Merged
merged 1 commit into from
Jan 6, 2020

Conversation

zixuanweeei
Copy link
Contributor

Description

Mainly aims to fix the gradients' explosion reported by #17086. Other enhancements are also enabled, which are listed in the Changes section.

Checklist

Changes

  • Flush memory before RNN backward primitive
  • Add gluon-rnn unit test for gradients check
  • Cache reorder
  • Re-write rnn supporting check
  • Update OpSignature.AddSign to avoid potential hash collision for rnn-packed memory

Comments

  • Backward performance should have been improved to some degree as we cached the reorder primitives. A performance comparison will be posted later. Besides, this patch is supposed to have no impact on the Forward path.

@ciyongch @TaoLv @pengzhao-intel

@zixuanweeei
Copy link
Contributor Author

We compared the performance between 1cfaf3c and this PR (c91c9e1). The result of the performance test on LSTM and GRU is shown below. It can be inferred from the promotion rate that this patch has improved the performance of Backward. The performance of some forward paths has decreased significantly, which requires more investigation.

num_layer=1, batch_size=64, seq_len=50, input_dim=512

directions cell_type hidden_size Forward Gap Backward Gap
1 lstm 512 -2.69% 7.17%
1 lstm 1024 -2.65% 10.02%
2 lstm 512 -3.13% 7.11%
2 lstm 1024 -1.44% 10.75%
1 gru 512 0.26% 9.63%
1 gru 1024 -2.91% 7.84%
2 gru 512 -2.33% 6.36%
2 gru 1024 -1.65% 4.90%
1 rnn_relu 512 -9.08% 7.08%
1 rnn_relu 1024 -7.52% 8.63%
2 rnn_relu 512 -6.93% 6.39%
2 rnn_relu 1024 -3.88% 8.68%
1 rnn_tanh 512 -8.68% 11.87%
1 rnn_tanh 1024 -7.71% 10.42%
2 rnn_tanh 512 -10.97% 4.60%
2 rnn_tanh 1024 -5.46% 13.53%

@leezu leezu added the R1.6.0 label Jan 2, 2020
@TaoLv
Copy link
Member

TaoLv commented Jan 3, 2020

@zixuanweeei Please remove [WIP] from the title and retrigger CI. @ciyongch please help to review. Thanks!

@zixuanweeei zixuanweeei changed the title [MKL-DNN][WIP] MKL-DNN RNN backward path enhancement [MKL-DNN] MKL-DNN RNN backward path enhancement Jan 3, 2020
@ciyongch
Copy link
Contributor

ciyongch commented Jan 3, 2020

@zixuanweeei Do you have any clue about the perf degradeation on forward pass? As the PR also done some refactor to the common code of forward and backward.

@zixuanweeei
Copy link
Contributor Author

zixuanweeei commented Jan 3, 2020

@zixuanweeei Do you have any clue about the perf degradeation on forward pass? As the PR also done some refactor to the common code of forward and backward.

Another perf test containing more warm-up loops gave the result below, which compared the performance between 89fe1f6 and 622d843. The refactor that have some impact on the forward pass is related to reorder. But reorder is only executed in the initialization process at the first iteration. RNN primitives may take more time to become stabilized. I think we can get this PR into master at first for settling the problem of gradients explode and release 1.6.x.

directions cell_type layer_num hidden_size Fwd Gap Bwd Gap
1 lstm 1 512 -0.09% 13.40%
1 lstm 1 1024 0.93% 8.47%
2 lstm 1 512 0.24% 8.03%
2 lstm 1 1024 0.44% 21.48%
1 gru 1 512 0.40% 11.57%
1 gru 1 1024 -0.81% 12.09%
2 gru 1 512 0.21% 7.71%
2 gru 1 1024 1.91% 5.63%
1 rnn_relu 1 512 2.03% 18.97%
1 rnn_relu 1 1024 10.51% 10.48%
2 rnn_relu 1 512 1.51% 9.22%
2 rnn_relu 1 1024 -0.42% 9.12%
1 rnn_tanh 1 512 2.87% 5.28%
1 rnn_tanh 1 1024 0.47% 8.09%
2 rnn_tanh 1 512 0.96% 6.88%
2 rnn_tanh 1 1024 0.27% 9.15%

Copy link
Contributor

@ciyongch ciyongch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* Flush memory before RNN backward primitive

* Add gluon rnn unit test for gradients check

* Cache reorder

* Re-write rnn supporting check

* Update OpSignature.AddSign to avoid potential hash collision for
rnn-packed memory

Get the data type from mkldnn memory descriptor when setting grad handle
@zixuanweeei
Copy link
Contributor Author

Thanks, @ciyongch .

@TaoLv @pengzhao-intel Please take a review again. I rebased the current branch on the official master, so we can merge it if CI passes.

CPU Performance and Quantization automation moved this from In progress to Reviewer approved Jan 6, 2020
Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Will merge the PR after CI pass.

@pengzhao-intel pengzhao-intel merged commit 83a23b0 into apache:master Jan 6, 2020
CPU Performance and Quantization automation moved this from Reviewer approved to Done Jan 6, 2020
zixuanweeei added a commit to zixuanweeei/mxnet that referenced this pull request Jan 6, 2020
* Flush memory before RNN backward primitive

* Add gluon rnn unit test for gradients check

* Cache reorder

* Re-write rnn supporting check

* Update OpSignature.AddSign to avoid potential hash collision for
rnn-packed memory

Get the data type from mkldnn memory descriptor when setting grad handle
TaoLv pushed a commit that referenced this pull request Jan 7, 2020
)

* [MKLDNN] mkldnn RNN operator enhancement (#17075)

* mkldnn rnn operator enhancement

`add` operation support

Rename AddTo

Add MXNET_USE_MKLDNN_RNN env

Add Env var for switching to naive RNN impl and naive add/copy impl

* Re-run CI, op:test_reduce failed on Unix-CPU

* Rerun CI, Python2 CPU on Unix-CPU timeout

* MKL-DNN RNN backward path enhancement (#17183)

* Flush memory before RNN backward primitive

* Add gluon rnn unit test for gradients check

* Cache reorder

* Re-write rnn supporting check

* Update OpSignature.AddSign to avoid potential hash collision for
rnn-packed memory

Get the data type from mkldnn memory descriptor when setting grad handle
@zixuanweeei zixuanweeei deleted the rnn/gradients branch March 12, 2020 03:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants