Change RNN OP to stateful #14476

lihaofd · 2019-03-20T02:40:32Z

Description

In this PR, it refactored RNN OP to stateful by using FStatefulCompute
@pengzhao-intel, @TaoLv , @ciyongch

Feature changes

New features

Support Stateful RNN OP registration

Unit-test changes

Using exisiting Unit Test case like tests/python/unittests/test_operator.py to check consistency with LSTMCell/GRUCell/RNNCell output

Checklist

Passed code style checking (make lint).
All changes have test coverage.
Code is well-documented.

TaoLv · 2019-03-20T02:57:53Z

@szha @sbodenstein Please help to review.

szha · 2019-03-20T05:43:00Z

cc @stephenrawls since the work is related.

pengzhao-intel · 2019-03-20T07:33:15Z

@szha @stephenrawls we are not 100% sure about the refactor of GPU part. Please help take a review and the suggestion and contributions are highly appreciated.

We can add the related folks into collaboration list for the code refactoring.

pinaraws · 2019-03-20T16:09:03Z

@mxnet-label-bot add[pr-awaiting-review]

stephenrawls · 2019-03-20T22:23:08Z

This will definitely impact my cudnn var-length PR. It's been on back burner for last couple weeks but been meaning to push it through soon.

If this is going to get merged first I can just update my PR to work with this change.

I'm not entirely familiar with the code base. Can you let me know what making RNN a stateful operator means? I'm not sure if this is something that I need to worry about in my PR or not.

Mostly it looks like this is just taking the cudnn specific header file and putting all that code in the generic rnn-inl.h header file?

szha · 2019-03-20T22:55:54Z

@stephenrawls This is to change the RNN op from legacy operator to a stateful operator. A stateful operator supports carrying arbitrary state from the forward computation to the backward computation. This makes it easy to pass states such as cudnn reserved spaces from forward op to the backward op.

…layers_fp32 etc

…hen USE_CUDA=1 USE_CUDNN=0

sync to latest code

lihaofd · 2019-03-25T07:58:57Z

ci is blocked by #14507

Fixes for CI downloads (apache#14504)

src/operator/rnn-inl.h

sxjscience · 2019-04-07T11:28:27Z

src/operator/rnn-inl.h

- DType* cx_ptr = NULL;
- DType* cy_ptr = NULL;
+ DType * cx_ptr = NULL;
+ DType * cy_ptr = NULL;


DType* cx_ptr looks better and makes the style more consistent.

src/operator/rnn-inl.h

sxjscience · 2019-04-07T13:21:39Z

Conducted one round of review. Some minor problems.

src/operator/rnn-inl.h

szha · 2019-04-09T01:25:15Z

@lihaofd thanks for the fix. Could you rebase to the current master? The flaky tests might have been disabled on master already.

sync to latest master

sync code

src/operator/rnn-inl.h

src/operator/rnn.cc

szha

Had an offline discussion with @eric-haibin-lin regarding the specific point on using storage handler for managing temporary space. Conclusions:

This is not a good practice as operator should not be able to interact with storage manager.
A good alternative is to create NDArray for the same purpose, and the operators can rely on its RAII property for life-cycle management.

However, we realize that this is a bad example that already happens in the existing code, so we would not want to impose the work to correct it on the PR requester. @lihaofd let me know if you prefer to correct this point (which would be really helpful!) or if you'd prefer to merge as is (which we can understand too).

lihaofd · 2019-04-12T00:44:06Z

@szha I suggest we can merge the current PR and keep enhancing it in the future. Thanks

eric-haibin-lin

Thanks for the contribution. This is significant amount of work to refactor all these code. Please check my comments. While my intent is not to block this PR from merging, I hope to get some clarifications. I have no objection merging the PR if there is urgency/dependency.

eric-haibin-lin · 2019-04-12T04:27:45Z

src/operator/rnn-inl.h

+ 3,
+ dim_w));
+
+ // Query weight layout


What are these?

cudnn provides an interface for querying the layout of the fused weight matrix. the current operator hard-codes it instead of relying on this interface since there isn't much change in the layout requirement.

eric-haibin-lin · 2019-04-12T04:30:07Z

src/operator/rnn-inl.h

+ // Allocate the reserve space
+ reserve_space_ = Storage::Get()->Alloc(reserve_space_byte_, Context::GPU(s->dev_id));
+ // Allocate the temp space
+ temp_space_ = Storage::Get()->Alloc(workspace_byte_, Context::GPU(s->dev_id));


If I understand it correctly, temp_space_ can be requested by adding kTempResource to resource request by checking dev_type in FResourceRequestEx, instead of calling StorageManager directly?

yes, it was done that way in the previous implementation. given that storage manager already is exposed due to reserved space I requested that the temp space be made consistent. it also has the benefit that the space can be shared between forward and backward since the size doesn't change.

eric-haibin-lin · 2019-04-12T04:30:58Z

src/operator/rnn-inl.h

- dep.push_back(out_data[rnn_enum::kStateOut]);
- dep.push_back(out_grad[rnn_enum::kStateOut]);
+/*
+index description


We can probably make some enums like RNNOpInputs for these indices

Depends on param.mode == rnn_enum::kLstm and param.state_outputs, the index order will be different for different RNN Type with or without stateoutput

* change RNN OP to stateful * retrigger the ci * fix windows compile issue * move cudnnrnn class into rnn-inl.h * fix some bugs * fix bug about gpu case like tests/python/gpu/test_gluon_gpu.test_rnn_layers_fp32 etc * fix gpu compile issue of unix-gpu and windows-gpu * print log for test * fix GPU NO CUDNN for unix-gpu case * print log * remove print log and make gpu case has same error message as master when USE_CUDA=1 USE_CUDNN=0 * fix typo bug * retrigger the ci * retrigger the ci * retrigger the ci * fix comments * retrigger the ci * fix comments * retrigger the ci * fix comments * retrigger the ci * retrigger the ci * retrigger the ci * retrigger the ci * fix comments

change RNN OP to stateful

8740d0f

TaoLv requested review from szha and sxjscience March 20, 2019 02:58

Li, Hao H added 2 commits March 20, 2019 12:26

retrigger the ci

9c38854

fix windows compile issue

50440e9

Li, Hao H added 2 commits March 20, 2019 22:07

move cudnnrnn class into rnn-inl.h

5324d28

fix some bugs

31ec7d0

marcoabreu added the pr-awaiting-review PR is waiting for code review label Mar 20, 2019

Li, Hao H added 2 commits March 21, 2019 21:25

fix bug about gpu case like tests/python/gpu/test_gluon_gpu.test_rnn_…

27d11e7

…layers_fp32 etc

fix gpu compile issue of unix-gpu and windows-gpu

a523b77

lihaofd force-pushed the statefulrnn branch from e47eae4 to a523b77 Compare March 22, 2019 05:01

Li, Hao H added 8 commits March 22, 2019 13:02

print log for test

148d110

fix GPU NO CUDNN for unix-gpu case

7b97de4

print log

58c7f86

remove print log and make gpu case has same error message as master w…

2edb811

…hen USE_CUDA=1 USE_CUDNN=0

fix typo bug

b94c452

retrigger the ci

bb6e9f1

Merge pull request #16 from apache/master

c58192d

sync to latest code

retrigger the ci

adc92d3

Hao Li added 2 commits March 26, 2019 12:12

Merge pull request #17 from apache/master

2501376

Fixes for CI downloads (apache#14504)

retrigger the ci

67b3909

szha reviewed Mar 26, 2019

View reviewed changes

src/operator/rnn-inl.h Outdated Show resolved Hide resolved

fix comments

3987b9c

pengzhao-intel mentioned this pull request Apr 5, 2019

[Discussion] 1.5.0 Roadmap #14619

Closed

sxjscience reviewed Apr 7, 2019

View reviewed changes

src/operator/rnn-inl.h Outdated Show resolved Hide resolved

sxjscience reviewed Apr 7, 2019

View reviewed changes

src/operator/rnn-inl.h Show resolved Hide resolved

sxjscience reviewed Apr 7, 2019

View reviewed changes

src/operator/rnn-inl.h Show resolved Hide resolved

fix comments

9b7c3fc

sxjscience approved these changes Apr 8, 2019

View reviewed changes

retrigger the ci

27a6840

szha suggested changes Apr 8, 2019

View reviewed changes

src/operator/rnn-inl.h Outdated Show resolved Hide resolved

Li, Hao H added 3 commits April 8, 2019 15:40

fix comments

0aec50e

retrigger the ci

69f7f78

retrigger the ci

c950d1f

Hao Li added 5 commits April 9, 2019 09:27

Merge pull request #18 from apache/master

61b7c06

sync to latest master

Merge pull request #19 from apache/master

100020a

sync to latest master

retrigger the ci

5c5a9a4

Merge pull request #21 from apache/master

0dc5053

sync code

retrigger the ci

d99d82a

szha reviewed Apr 10, 2019

View reviewed changes

src/operator/rnn-inl.h Outdated Show resolved Hide resolved

szha reviewed Apr 10, 2019

View reviewed changes

src/operator/rnn.cc Outdated Show resolved Hide resolved

fix comments

69295ca

szha approved these changes Apr 11, 2019

View reviewed changes

eric-haibin-lin reviewed Apr 12, 2019

View reviewed changes

szha merged commit 1c49e40 into apache:master Apr 13, 2019

TaoLv mentioned this pull request May 23, 2019

Flakey test: test_operator_gpu.py:test_rnntanh_bidirectional #15034

Open

DickJC123 mentioned this pull request May 24, 2019

GPU RNN to use TempSpace resource for workspace. #15056

Merged

5 tasks

zixuanweeei mentioned this pull request Aug 9, 2019

MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) #15741

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change RNN OP to stateful #14476

Change RNN OP to stateful #14476

lihaofd commented Mar 20, 2019

TaoLv commented Mar 20, 2019

szha commented Mar 20, 2019

pengzhao-intel commented Mar 20, 2019

pinaraws commented Mar 20, 2019

stephenrawls commented Mar 20, 2019

szha commented Mar 20, 2019

lihaofd commented Mar 25, 2019

sxjscience Apr 7, 2019

lihaofd Apr 8, 2019

sxjscience commented Apr 7, 2019

szha commented Apr 9, 2019

szha left a comment

lihaofd commented Apr 12, 2019

eric-haibin-lin left a comment •

edited

Loading

eric-haibin-lin Apr 12, 2019

szha Apr 12, 2019

eric-haibin-lin Apr 12, 2019

szha Apr 12, 2019

eric-haibin-lin Apr 12, 2019

lihaofd Apr 12, 2019

Change RNN OP to stateful #14476

Change RNN OP to stateful #14476

Conversation

lihaofd commented Mar 20, 2019

Description

Feature changes

New features

Unit-test changes

Checklist

TaoLv commented Mar 20, 2019

szha commented Mar 20, 2019

pengzhao-intel commented Mar 20, 2019

pinaraws commented Mar 20, 2019

stephenrawls commented Mar 20, 2019

szha commented Mar 20, 2019

lihaofd commented Mar 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sxjscience commented Apr 7, 2019

szha commented Apr 9, 2019

szha left a comment

Choose a reason for hiding this comment

lihaofd commented Apr 12, 2019

eric-haibin-lin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin left a comment •

edited

Loading