Add support for fast variable-length LSTM #14208

stephenrawls · 2019-02-19T21:52:39Z

Description

This is very much a work in progress (WIP).

I am trying to add support for variable length LSTMs in MxNet, and need help with someone to help me complete the PR.

This PR currently adds support for variable length sequences using the built-in support of that by cuDNN. It currently does not add CPU support but probably should.

I had to make a few changes to get it working that definitely need to change before we think about merging:

I changed the way gluon handles kwargs to model blocks. Originally I did this because I thought I needed to pass in use_sequence_length=True, sequence_length=sequence_length the way the current SequenceLast/SequenceReverse/etc operators do. Now I think I probably don't need this since I moved the use_sequence_length parameter to the constructor. (Although perhaps we don't need this at all? it seems clunky to me, but I was trying to follow existing conventions).
Currently I hard-coded the F.RNN call in rnn_layer.py to look like: .RNN(inputs, params, states[0], states[1], sequence_length, ... because this was giving me python syntax errors: .RNN(inputs, params, *states, sequence_length,...
optimizer.py had a bunch of imports that were failing for me (e.g. multi_sgd_update). Are they not failing for everyone else? Not sure why they just failed for me, but I had to remove to get my test script to run.
I had to modify imperative_utils.h to temporarily remove the check that all ndarrays are on the same context, because cudnn expects the sequence length array to live on the CPu.

Still need to add unit tests to test this feature. currently I am using the following test script:
https://gist.github.com/stephenrawls/8a4f4cb941f0c3f32bf5a8b3b12d32c0

Unfortunately it looks like the test I have is failing. Not sure why might need help from cudnn guys to help track down.

I have tested with cudnn v7.4.2.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

ptrendx · 2019-02-19T23:34:40Z

3\. optimizer.py had a bunch of imports that were failing for me (e.g. `multi_sgd_update`). Are they not failing for everyone else? Not sure why they just failed for me, but I had to remove to get my test script to run.

This is very strange - are you sure you are importing the right version of .so? Those functions are not present in 1.3.1 or 1.4.0 but should be present on master.

4\. I had to modify imperative_utils.h to temporarily remove the check that all ndarrays are on the same context, because cudnn expects the sequence length array to live on the CPu.

This is going to be a bigger issue since I don't think NNVM currently allows you to do inputs on different contexts. I think @sandeep-krishnamurthy was going to look at it at some point for handling of nvJPEG.

drivanov

The lines are in wrong order and it leads to compilation problems.
The lines, currently ordered as

          << inputs[i]->ctx();
    */
    }

need to be replaced by

          << inputs[i]->ctx();
       }
      */

stephenrawls · 2019-02-20T00:44:53Z

Sorry will fix shortly. Was a snafu when trying to clean up my code before pushing to GitHub. Thanks for the catch!

stephenrawls · 2019-02-20T03:39:46Z

@szha maybe this deserves another PR but I also noticed that cuDNN supports both sequence-major and batch-major sequences:

From: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnRNNForwardInferenceEx

With unpacked layout, both sequence major (i.e. time major) and batch major are supported

Currently gluon is supporting batch-major sequences by calling swapaxes:
https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/rnn/rnn_layer.py#L241-L242

This ends up calling mshadow's swapaxis function, which I assume must be doing a copy (?) but to be honest I couldn't totally follow what the code is doing at a quick glance:
https://github.com/dmlc/mshadow/blob/3dc80815d965b56b9a975dc27229361955bf66fe/mshadow/extension/swapaxis.h

If it is doing a copy for swapaxes, it would be preferable for Gluon to just let cudnn handle the batch-major layout directly w/o calling swapaxes.

szha · 2019-02-20T05:01:00Z

@stephenrawls swapaxis is doing an actual transpose

TaoLv · 2019-02-20T10:04:38Z

@stephenrawls Still cannot get the whole picture of this PR. Can you post a proposal for this task with below information? Then we can think about how to support it on CPU side. Thank you.

What does the front end API look like? What's the typical usage of the API?
What's done in Gluon blocks/hybridblocks and sym.RNN to support that?
What's done inside cudnn?
From the title, only LSTM is supported?

stephenrawls · 2019-02-20T17:37:47Z

Hi @TaoLv ,

Basically my motivation is: we want to use stacked bidirectional LSTM's with variable sequence length. Currently this is slow in MxNet because to do it right you either have to (1) Use the LSTMCell and unroll which doesn't take advantage of cuDNN on the GPU; (2) Use cuDNN one layer at a time, and do a lot of reversing of your output to pass into the backward-direction lstm, so you can be sure that padding doesn't effect the result.

This all seemed a bit silly since cuDNN directly provides support for variable length stacked bidirectional LSTMs efficiently.

Beyond my direct use case, for the community I guess it would probably be nice if:

All RNN types are supported. (This happens already via cuDNN even though I am focussed on LSTM)
Both CPU and GPU are supported

I guess the API I have is currently like this:

lstm = mx.gluon.LSTM(..normal params..., use_sequence_length=True)
lstm(input, sequence_length)

I am not tied to this api however. For example if we want to not have a use_sequence_length param in initialization then I'm happy to remove it. I only put it there in the first place to be similar to existing Sequence* family of operators.

As far as what changes need to happen on the symbol side, I am less sure, I primarily use gluon. Happy for folks to offer suggestions.

Thanks,
Stephen

stephenrawls · 2019-02-20T17:58:05Z

P.S. There was a GitHub issue for this last year but it sort of morphed into other things and this particular piece didn't get accomplished;

#9543

drivanov · 2019-02-20T20:02:15Z

@stephenrawls: With following simple changes in your test_varlength_lstm.py:

ctx = mx.gpu()
X = mx.nd.random.randn(timesteps, batch_size, nchannels).as_in_context(ctx)
valid_lengths = mx.nd.array([10,10,10,7,6]).as_in_context(ctx) # [10,10,10,10,6] seems to work fine

you don't need to modify imperative_utils.h

stephenrawls · 2019-02-20T20:29:06Z

@drivanov -- yes if I move the sequence length array to the gpu then I don't have to change the logic the verifies everything is in the same context. The reason I am keeping the sequence length array on the cpu even though the data is on the gpu is because the cudnn api is that the sequence length array is on cpu, i.e. host memory.

If I pass in the sequence length array on GPU memory then I would have to do a GPU -> CPU copy inside of the lstm operator, right?

stephenrawls · 2019-02-21T00:12:42Z

Latest push from me now gets variable length lstms working correctly in cudnn. Turns out that in addition to setting the layout to UNPACKED when creating input descriptors, we also need to call a function to enable unpacked padded LSTMs prior to calculating the operator's workspace size.

Now the forward pass works correctly and my rudimentary tests pass.

Still need help figuring out what to do with passing sequence_length input on the CPU. Any ideas?

roywei · 2019-02-21T01:06:53Z

@mxnet-label-bot add[RNN, Backend, Gluon, pr-work-in-progress]

szha · 2019-02-21T05:48:43Z

@stephenrawls for now you can do a CHECK(!use_seq_len) in CPU kernel. @pengzhao-intel @TaoLv will see how to add it for CPU kernels.

stephenrawls · 2019-02-21T17:55:56Z

@szha okay I will do that and update PR shortly.

My question though (sorry for not being clear) is about the GPU kernel. Currently the cuDNN api requires the sequence_length array to be in host memory, so I am passing the sequence_length array on the CPU even though the data array is on the GPU.

Currently the imperative_utils.h requires that all ndarrays passed to an operator are on the same context. To get around this I just commented out that check so I could at least test that my changes were working.

My question is--(1) What's the right way to modify the context check in imperative_utils.h? (2) Is there something similar that happens on the symbol api side that would have to change?

The alternative is to just pass the sequence_length array in on the GPU, but this would require an extra GPU -> CPU data transfer inside the operator that I would prefer to avoid.

szha · 2019-02-21T21:58:03Z

@stephenrawls the length array is generally small so I think it's fine to do a copy inside the operator. @reminisce @junrushao1994 are looking into how to refurbish the backend so that such cases can be better supported.

ptrendx · 2019-02-21T22:09:16Z

To answer your question (2) Is there something similar that happens on the symbol api side that would have to change?: yes, place device graph pass will just error out saying that it is impossible to mix devices.

I had a similar problem in aggregate SGD case, where I wanted to put learning rates and weight decay rates as CPU side NDArray. I ended up passing them as a Tuple argument to the op in Python.

stephenrawls · 2019-02-26T01:47:28Z

@szha Okay I re-worked the code to take in sequence_length on the GPU and copy to the CPU inside the kernel. I added a unit test to test the functionality and it appears to be working.

For some reason I am getting errors when trying to pass an int32 sequence_length array. Not sure why, because I changed infer_shape() to allow it. Can look into it later but if anyone has suggestions please let me know.

@ptrendx -- I figured out the weird import error and reverted those changes to optimizer.py in my PR. Also as mentioned above, since I am now doing an extra GPU->CPU copy in the kernel code, I have also reverted the change to imperative_utils.h.

@szha -- I think all that is remaining in this PR is to add the CHECK(!use_seq_len) for the CPU case, to figure out why int32 sequence_length is not working, and any cleanup / modifications you think are appropriate. Also--I added a unit test for gluon, but should I also add one for the Symbol api code, or is it sufficient to test this functionality in Gluon?

pinaraws · 2019-03-20T00:46:45Z

@szha Could you please respond to questions posted by @stephenrawls ?

piyushghai · 2019-04-09T00:32:53Z

@stephenrawls Can you rebase with the latest master to resolve the merge conflicts ?

stephenrawls · 2019-04-17T03:25:22Z

FYI I am looking at this. I have my changes rebased against the latest changes from master now, and updated to fit the new pattern. I just need to build and test. Since things have changed significantly since before I just need to re-setup my build environment, should have an update in a day or so.

python/mxnet/gluon/rnn/rnn_layer.py

stephenrawls · 2019-04-28T05:30:12Z

@szha finally got around to rebasing my PR and pushing the update.

The centos-gpu build is failing. The new var-length unit test is failing with this error:

ERROR: test_gluon_gpu.test_layer_bidirectional_varseqlength
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 112, in test_new
    assert_raises((MXNetError, RuntimeError), orig_test, *args, **kwargs)
  File "/usr/lib64/python3.6/unittest/case.py", line 750, in assertRaises
    return context.handle('assertRaises', args, kwargs)
  File "/usr/lib64/python3.6/unittest/case.py", line 195, in handle
    callable_obj(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/test_gluon_gpu.py", line 290, in test_layer_bidirectional_varseqlength
    check_layer_bidirectional_varseqlen(7, 5)
  File "/work/mxnet/tests/python/gpu/test_gluon_gpu.py", line 267, in check_layer_bidirectional_varseqlen
    net_output = net(data, sequence_length).asnumpy()
  File "/work/mxnet/python/mxnet/gluon/block.py", line 540, in __call__
    out = self.forward(*args)
  File "/work/mxnet/python/mxnet/gluon/block.py", line 912, in forward
    self._deferred_infer_shape(x, *args)
  File "/work/mxnet/python/mxnet/gluon/block.py", line 793, in _deferred_infer_shape
    raise ValueError(error_msg)
ValueError: Deferred initialization failed because shape cannot be inferred. Error in operator lstm_rnn0: Shape inconsistent, Provided = [10], inferred shape=(2,10,7)
-------------------- >> begin captured stdout << ---------------------
infer_shape error. Arguments:
  data0: (11, 10, 5)
  data1: (10,)

I'm not sure why yet, because when I run the unit test code on my own machine with my own build it works.

Also the unix-gpu build is failing for the unit test test_operator.test_custom_op_exc which seems unrelated to my PR.

I'll see if I can figure out why the centos-gpu build is failing (even though it works for me on my own machine) tomorrow. But if you have any ideas / advice please let me know.

…gth argument

…parameter

stephenrawls · 2019-05-07T00:23:06Z

rebased against mainline (requiring a force-push), which now gets all unit tests passing.

src/operator/rnn-inl.h

szha · 2019-05-07T00:37:35Z

tests/python/gpu/test_gluon_gpu.py

+
+ # Only compare the valid sections for each batch entry
+ for b in range(batch_size):
+ assert_allclose(net_output[:sequence_length_np[b], b], ref_net_output[:sequence_length_np[b], b])


this doesn't seem to test if the length-based masking is working properly, because the reference implementation also relies on sequence length feature. consider using LSTMCell as reference instead.

The reference net is not using the sequence length feature of cudnn, because use_sequence_length defaults to false.

The reference net is actually manually implementing the variable sequence length support by using two LSTMs for forward/backward direction and manually handling reversing them and concatenating the forward/backward directions. That is, it is doing it a slower way than via cudnn, but in a way we know should produce correct results.

I see. I mistook the use_sequence_length flag to be in rnn op. Still, whether the returned state is of the right step or not is not tested, which is also an important aspect of variable length RNN support. It may be hard to test it using RNN layer as reference.

Ah, good point. I am only testing the returned output, not the return state.

I guess I could just loop over the batch elements one-by-one, passing them each in turn to the reference lstm. That way each input is correctly sized and I can easily grab the right return state.

Should I do that now or in the follow-on PR?

I'm fine for it to be in a follow-up PR.

src/imperative/imperative_utils.h

stephenrawls · 2019-05-07T02:59:28Z

All unit tests except for the following passed:

AssertionError: Diff. between MXNet & TensorRT accuracy too high:
           MXNet = 99.150000, TensorRT = 99.140000

So I pushed a new minor change to trigger unit tests to run again and get this flaky unit test to pass.

* initial commit for variable length sequence support w/ cudnn * removing check about all vectors on same context (need to add back in) * fixing commented-out code to actually coment-out what I wanted * fixing cudnn layout type to be unpacked in var-length seq case * looks like param.batch_size_ etc weren't previousy getting set in cudnn operator code. still doesn't fix cudnn error though * must call cudnnSetRNNPaddingMode() to enable unpacked padded sequences * cleaning up & adding unit tests * cleanign up * cleanign up * removing stringstream and checking for cudnn >= 7.2 * fixing whitespace formatting errors; adding ifdef version guard for cudnn padding * fixing a few syntax errors * changing order of arguments in hybird_forward for backward compatibility * more build validation fixes * using emplace_back to make linter happy * adding import of mxnet.ndarray * switching order of sequence_length in hybrid_forward again * adding __call__ override to rnn layer to handle optional sequence_length argument * whoops swapped order of args in one place but not the other * changing type() to isinstance() to make linter happy * changing lstm var seq length call to explciitly name sequence_length parameter * fixing bad scope of if-statement checking state outputs * resolving reviewer comments * making linter happy by putting var definitions in appropriate ifdef * fixing linter again * fixing whitespace issues with linter * fixing whitespace issues with linter * fixing some typos that emerged fixing linter * linter * fixing more whitespace issues * only access kTempSpace if on gpu * removing tabs that slipped in * fixing too-long line * changing ifdef guard to be more generic * reverting change so whitespace stays same w/ master * adding todo comment

stephenrawls requested review from eric-haibin-lin, nswamy and szha as code owners February 19, 2019 21:52

drivanov reviewed Feb 20, 2019

View reviewed changes

marcoabreu added Backend Issues related to the backend of MXNet Gluon pr-work-in-progress PR is still work in progress RNN labels Feb 21, 2019

szha reviewed Apr 21, 2019

View reviewed changes

python/mxnet/gluon/rnn/rnn_layer.py Outdated Show resolved Hide resolved

stephenrawls force-pushed the rnn-var-seqlen branch from 8cbd124 to 0edb5c2 Compare April 28, 2019 02:34

stephenrawls added 16 commits May 6, 2019 22:25

adding __call__ override to rnn layer to handle optional sequence_len…

f8ea574

…gth argument

whoops swapped order of args in one place but not the other

a5d85c6

changing type() to isinstance() to make linter happy

a61f861

changing lstm var seq length call to explciitly name sequence_length …

862ce69

…parameter

fixing bad scope of if-statement checking state outputs

e500463

resolving reviewer comments

143ddbb

making linter happy by putting var definitions in appropriate ifdef

8ded5da

fixing linter again

806cbbc

fixing whitespace issues with linter

8625537

fixing whitespace issues with linter

36dc234

fixing some typos that emerged fixing linter

aeed5f4

linter

ee05388

fixing more whitespace issues

5691c33

only access kTempSpace if on gpu

fcf50a2

removing tabs that slipped in

b0c978c

fixing too-long line

6b14793

stephenrawls force-pushed the rnn-var-seqlen branch from 889148f to 6b14793 Compare May 6, 2019 22:26

szha reviewed May 7, 2019

View reviewed changes

src/operator/rnn-inl.h Outdated Show resolved Hide resolved

szha reviewed May 7, 2019

View reviewed changes

src/imperative/imperative_utils.h Outdated Show resolved Hide resolved

stephenrawls added 3 commits May 7, 2019 00:39

changing ifdef guard to be more generic

dd5bf2d

reverting change so whitespace stays same w/ master

ad73af2

adding todo comment

1ee5155

szha merged commit 527573e into apache:master May 7, 2019

Ishitori mentioned this pull request Jun 18, 2019

Backward doesn't work on LSTM with sequence_length #15268

Closed

zjost mentioned this pull request Oct 9, 2020

Variable sequence length not handled correctly for BiDirectional layers #19323

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for fast variable-length LSTM #14208

Add support for fast variable-length LSTM #14208

stephenrawls commented Feb 19, 2019 •

edited

Loading

ptrendx commented Feb 19, 2019

drivanov left a comment

stephenrawls commented Feb 20, 2019

stephenrawls commented Feb 20, 2019

szha commented Feb 20, 2019

TaoLv commented Feb 20, 2019

stephenrawls commented Feb 20, 2019

stephenrawls commented Feb 20, 2019

drivanov commented Feb 20, 2019

stephenrawls commented Feb 20, 2019

stephenrawls commented Feb 21, 2019

roywei commented Feb 21, 2019

szha commented Feb 21, 2019

stephenrawls commented Feb 21, 2019

szha commented Feb 21, 2019

ptrendx commented Feb 21, 2019

stephenrawls commented Feb 26, 2019

pinaraws commented Mar 20, 2019

piyushghai commented Apr 9, 2019

stephenrawls commented Apr 17, 2019

stephenrawls commented Apr 28, 2019

stephenrawls commented May 7, 2019

szha May 7, 2019

stephenrawls May 7, 2019

szha May 7, 2019

stephenrawls May 7, 2019

szha May 7, 2019

stephenrawls commented May 7, 2019

Add support for fast variable-length LSTM #14208

Add support for fast variable-length LSTM #14208

Conversation

stephenrawls commented Feb 19, 2019 • edited Loading

Description

Checklist

Essentials

Changes

Comments

ptrendx commented Feb 19, 2019

drivanov left a comment

Choose a reason for hiding this comment

stephenrawls commented Feb 20, 2019

stephenrawls commented Feb 20, 2019

szha commented Feb 20, 2019

TaoLv commented Feb 20, 2019

stephenrawls commented Feb 20, 2019

stephenrawls commented Feb 20, 2019

drivanov commented Feb 20, 2019

stephenrawls commented Feb 20, 2019

stephenrawls commented Feb 21, 2019

roywei commented Feb 21, 2019

szha commented Feb 21, 2019

stephenrawls commented Feb 21, 2019

szha commented Feb 21, 2019

ptrendx commented Feb 21, 2019

stephenrawls commented Feb 26, 2019

pinaraws commented Mar 20, 2019

piyushghai commented Apr 9, 2019

stephenrawls commented Apr 17, 2019

stephenrawls commented Apr 28, 2019

stephenrawls commented May 7, 2019

szha May 7, 2019

Choose a reason for hiding this comment

stephenrawls May 7, 2019

Choose a reason for hiding this comment

szha May 7, 2019

Choose a reason for hiding this comment

stephenrawls May 7, 2019

Choose a reason for hiding this comment

szha May 7, 2019

Choose a reason for hiding this comment

stephenrawls commented May 7, 2019

stephenrawls commented Feb 19, 2019 •

edited

Loading