split_and_load can now handle num_ctx > num_data. Github Issue #13909 #14607

mightydeveloper · 2019-04-03T07:34:55Z

Description

Handles Issue #13909
when last batch size is smaller than the number of contexts,
the previous utility function gluon.utils.split_and_load throws an exception ValueError: Too many slices for data with shape ....
However, we can just put one data per context and ignore the remaining contexts.
This integrates nicely with the given reproducing example code. (User does not need to modify the code)

for d, l in zip(data, label):
    with autograd.record():
        out = net(d)
        losses.append(loss_fn(out, l))
for loss in losses:
    loss.backward()

where data and label will only output data that exists in the first few of the given context.
the forward / backward pass is only done in the contexts where needed. (remaining contexts does not need to do a fake forward/backward pass)
My concern is people can make mistakes when calculating mean of the losses.

So it would be nice if we add this behavior to the documentation. (split_and_load can output a list that has a size less than number of contexts when even_split=False)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.
-> I believe this PR is just tiny change

mightydeveloper · 2019-04-03T07:47:20Z

Just an FYI, I made a misleading comment on the code previously, so I fixed it.

wkcn · 2019-04-03T10:52:25Z

I am not sure how the weights update for the context without inpus when backward.

mightydeveloper · 2019-04-03T12:07:22Z

@wkcn
In the example above, losses only contains loss terms that are forwarded by real data samples. (meaning that unnecessary(possibly fake) context loss are not appended from the previous for loop.)
So I believe when we do loss.backward(), only the contexts that are appended from the previous for loop will run and calculate gradients for previously marked variables from previous for loop.

For example, suppose we have 3 examples left from the dataset and have 5 GPU contexts.
We would only call losses.append(loss_fn(out, l)) 3 times, marking variables in 0, 1, 2 GPU contexts.
So when we call

for loss in losses:
    loss.backward()

only gradients for GPU 0, 1, 2 will be calculated and when we eventually call trainer.step(), the weights will be updated.
Does this answer help you? (I might have misunderstood your concern)

(+ Also, I noticed that I have an indentation error in my original PR request description so I just edited it.)

wkcn · 2019-04-03T13:01:33Z

@mightydeveloper
Thank you for the detailed explanation!
Yes, only gradients for GPU 0, 1, 2 will be calculated.

However, it seems that trainer.step() will update the weights by using all gradients on 5 GPU contexts.
The gradients on GPU 3, 4 may be not zero-tensors.

The code of trainer.step()

wkcn

LGTM : )
Thanks for your contribution!

mightydeveloper · 2019-04-03T14:05:22Z

However, it seems that trainer.step() will update the weights by using all gradients on 5 GPU contexts.
The gradients on GPU 3, 4 may be not zero-tensors.

Thanks for checking and providing relevant code link!

So... If I want to zero out gradients, I guess I should either

call trainer._params.zero_grad()
or just net.collect_params().zero_grad()
(Assuming that I initialized trainer with Trainer(net.collect_params(), ...)
or maybe call trainer.step(ignore_stale_grad=True) instead?

Would all three of them work? which one would be better?

wkcn · 2019-04-03T15:16:52Z

calling trainer.step(ignore_stale_grad=True) is okay : )

piyushghai · 2019-04-03T17:31:25Z

@mightydeveloper Thanks for your contributions.

@mxnet-label-bot Add [Gluon]

wkcn · 2019-04-08T07:23:16Z

Merged. Thanks for your contribution!

…pache#14607)

mightydeveloper requested a review from szha as a code owner April 3, 2019 07:34

split_and_load can now handle num_ctx > num_data. Issue #13909

31fc89a

wkcn approved these changes Apr 3, 2019

View reviewed changes

marcoabreu added the Gluon label Apr 3, 2019

wkcn added the pr-awaiting-review PR is waiting for code review label Apr 7, 2019

wkcn merged commit daabe5c into apache:master Apr 8, 2019

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019

split_and_load can now handle num_ctx > num_data. Issue apache#13909 (a…

5f2f326

…pache#14607)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split_and_load can now handle num_ctx > num_data. Github Issue #13909 #14607

split_and_load can now handle num_ctx > num_data. Github Issue #13909 #14607

mightydeveloper commented Apr 3, 2019 •

edited

Loading

mightydeveloper commented Apr 3, 2019

wkcn commented Apr 3, 2019 •

edited

Loading

mightydeveloper commented Apr 3, 2019 •

edited

Loading

wkcn commented Apr 3, 2019 •

edited

Loading

wkcn left a comment

mightydeveloper commented Apr 3, 2019

wkcn commented Apr 3, 2019

piyushghai commented Apr 3, 2019

wkcn commented Apr 8, 2019

split_and_load can now handle num_ctx > num_data. Github Issue #13909 #14607

split_and_load can now handle num_ctx > num_data. Github Issue #13909 #14607

Conversation

mightydeveloper commented Apr 3, 2019 • edited Loading

Description

Checklist

Essentials

mightydeveloper commented Apr 3, 2019

wkcn commented Apr 3, 2019 • edited Loading

mightydeveloper commented Apr 3, 2019 • edited Loading

wkcn commented Apr 3, 2019 • edited Loading

wkcn left a comment

Choose a reason for hiding this comment

mightydeveloper commented Apr 3, 2019

wkcn commented Apr 3, 2019

piyushghai commented Apr 3, 2019

wkcn commented Apr 8, 2019

mightydeveloper commented Apr 3, 2019 •

edited

Loading

wkcn commented Apr 3, 2019 •

edited

Loading

mightydeveloper commented Apr 3, 2019 •

edited

Loading

wkcn commented Apr 3, 2019 •

edited

Loading