Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-560] Add temperature parameter in Softmax operator #11466

Merged
merged 23 commits into from
Jul 19, 2018

Conversation

apeforest
Copy link
Contributor

Description

This PR is to address the request to have a native temperature parameter in the softmax functions. See Issue for more detailed discussion.

I have added the temperature parameter to softmax operator. By default the temperature parameter value is 1.0 and both functions should behave the same as not specifying the temperature parameter.

Verified the change using the following code in Python:

import mxnet as mx

data = mx.sym.Variable('data')
net = mx.sym.softmax(data=data, temperature=10)

x = mx.nd.array([ 1,  2,  3])

ex = net.bind(mx.cpu(), args={'data': x, 'softmax2_label': 'softmax2'})
ex.forward()

should expect return

[
[ 0.30060961  0.33222499  0.3671654 ]
 <NDArray 3 @cpu(0)>]

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • [X ] The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • [X ] All changes have test coverage:
  • Added a unit test to cover the new parameter
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Softmax operator with new temperature parameter(unittest at test_operator.py:test_softmax_with_temperature)

Comments

  • This change is backward compatible. The default value of temperature is 1.0f
  • Because 90% of the time (empirically) user will use this operator using the default temperature, I have added a "if else" branch in CPU computation to optimize the runtime by getting rid of the unnecessary 'divide-by-one' operation. This is not done in GPU kernel because 1) adding if branch will have negative runtime impact in GPU, 2) CUDA compiler may perform additional optimization on this edge case.
  • I have use check_speed function to verify this optimization in CPU. In my experiment, I choose a vector of size 10000 and run 10 times. The runtime reduction is about 25%. Experiment code is pasted below:
import mxnet as mx
from mxnet.test_utils import check_speed
data_shape=(1,10000)
data = mx.sym.Variable(name='data',shape=data_shape)
ctx=mx.cpu(0)
x = mx.nd.random.normal(0, 1.0, shape=data_shape, ctx=ctx)
net = mx.sym.softmax(data=data)
softmax_time = check_speed(sym=net, location={'data': x}, ctx=ctx, N=10, grad_req='null', typ='forward') * 1000
print(softmax_time)

@apeforest
Copy link
Contributor Author

apeforest commented Jun 28, 2018

@rahul003 @haojin2 @eric-haibin-lin @sandeep-krishnamurthy @Roshrini @szha

Thanks all for your reviews on my previous [WIP] PR. I have created this new one for review due to the mess-up caused by my previous merge. This PR also addressed the comments raised in previous one.

@@ -53,7 +53,7 @@ struct log_softmax_fwd {

template<typename OP, typename DType, int ndim>
inline void Softmax(Stream<cpu> *s, DType *in, DType *out,
Shape<ndim> shape, int axis) {
Shape<ndim> shape, int axis, const float temperature) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we support double precision and half precision too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a parameter, so it should be of a specific type. I think to ensure the highest precision we should take this parameter in as a double type and cast it to the DType during runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I have changed the data type from float to generic DType

@azai91
Copy link
Contributor

azai91 commented Jun 29, 2018

@pengzhao-intel @zheng-da do you know if MKLDNN supports the temperature parameter? If not then we will just have to fallback to fcompute

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the contribution. looks in good shape

data = np.random.uniform(-2, 2, size=shape)
sym = mx.sym.softmax(axis=axis, temperature=temp)
expected = np_softmax(data, axis=axis, temperature=temp)
check_symbolic_forward(sym, [data], [expected])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why check_symbolic_backward is not tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check_symbolic_backward test

@@ -77,10 +78,13 @@ MXNET_OPERATOR_REGISTER_UNARY(softmax)
The resulting array contains elements in the range (0,1) and the elements along the given axis sum up to 1.

.. math::
softmax(\mathbf{z/t})_j = \frac{e^{z_j/t}}{\sum_{k=1}^K e^{z_k/t}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not remove line 82?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@apeforest
Copy link
Contributor Author

@eric-haibin-lin Addressed your comments. Please review at your convenience again. Thx

@sxjscience
Copy link
Member

Is mx.sym.softmax(data=data, temperature=10) the same as mx.sym.softmax(data=data / 10) ?

@apeforest
Copy link
Contributor Author

@slitsey Could you please check if this enhancement meet your request?

@apeforest
Copy link
Contributor Author

@sxjscience You raised a good point. I should have thought it through earlier. Running forward() function is the same if you just specify data/t as the input to data. However, I am not sure if the backward() function will behave the same since taking the derivative of the softmax function will be different with the temperature parameter in the exponential term. Need more expert opinion on this.

@slitsey
Copy link

slitsey commented Jul 5, 2018

@apeforest @sxjscience I think backpropagation will be different for a SoftmaxOutput final layer, because the derivative of the softmax with temperature will have an extra factor of 1/temperature. This leads to an extra factor of 1/temperature in all the d error/d weight terms. (Someone is welcome to double-check my math.) So it's not the same as uniformly scaling the data. Another way to think about this is that the derivatives are still with respect to the unscaled data. In other words, d f(x/T) /d x is what is used in backpropagation, not d f(x/T) /d (x/T), where f is the usual softmax function (temperature of 1). See also the comment here by isarandi on the first answer.

@haojin2
Copy link
Contributor

haojin2 commented Jul 5, 2018

@slitsey It's actually doable with existing layers: do an _mul_scalar first and the feed the output to normal softmax, the backward of _mul_scalar will add the 1/T to the grad during back propagation.

@slitsey
Copy link

slitsey commented Jul 5, 2018

@haojin2 That's a good point. Would that still be as computationally efficient? Would adding an extra layer like this make things appreciably slower than having a native parameter, or would it be essentially the same?

@haojin2
Copy link
Contributor

haojin2 commented Jul 5, 2018

@slitsey I don't think it would be introducing a lot of overhead, plus we're not performing that many changes if using existing ops.

@slitsey
Copy link

slitsey commented Jul 5, 2018

@haojin2 That's what I figured. In that case, I leave it up to the community to decide how to proceed. If we decide not to incorporate the parameter natively and go with the _mul_scalar layer solution, I would at least request a documentation change to SoftmaxOutput pointing this out as an example use case of chaining these two layers, given how prominent it is.

@haojin2
Copy link
Contributor

haojin2 commented Jul 5, 2018

@slitsey Actually going in the direction I just described would require some change in the python code for SoftmaxOutput, or we can make a new API named something like SoftmaxWithTemperature, and we can then have the proper documentation there.

@apeforest
Copy link
Contributor Author

@slitsey I am leaning towards @haojin2 's second suggestion to make a new API named SoftmaxWithTemperature. This way it is easier to use and does not impact the performance of regular Softmax operator. Please let me know what you think.

@slitsey
Copy link

slitsey commented Jul 6, 2018

@apeforest I'm fine with that idea. Would we maintain the softmax vs SoftmaxOutput distinction too? So two new API additions?

@haojin2
Copy link
Contributor

haojin2 commented Jul 6, 2018

Well we also have the SoftmaxActivation thing... so I guess 3 new API additions then. But however we do this there'll be some API changes, and since the usage of this is not as big as the T=1 case, I would prefer that we add this in the form of new APIs. Just want to be clear here that I'm not opposed to adding a new option to the existing APIs, but I also want to make sure we have minimal regression of performance and introduce the least confusions for our users.

@slitsey
Copy link

slitsey commented Jul 6, 2018

@haojin2 That makes sense. To play devil's advocate, I would argue that having six different variations of softmax is a bit ridiculous, and more confusing than adding a single parameter to existing functions. Since the default behavior is unchanged, most users won't even know the difference from a UX perspective. You can't create a new function every time someone wants to add a new feature to an existing function; that kind of function proliferation generates far more user confusion than adding a parameter they can safely ignore.

The performance regression argument is much more significant. However, while I'm not an expert on this, I suspect that adding an if statement to a function's execution is about as minimal a performance drop as you can get. (Especially given that's its just deciding whether or not to perform a floating point division operation!) Most other modifications would have a more substantial impact. Feel free to correct me if I'm mistaken on this point.

Finally, making an enhancement to an important function like softmax can itself induce use of that feature. I wouldn't be surprised if the fraction of softmax users that employ this feature ends up being higher than one may think, given that it's a hyperparameter applicable to most use cases that tends to be ignored only due to ignorance or to maintain simplicity.

Anyway, I'll let everyone else weigh in. For my own use case, having new SoftmaxWithTemperature functions would be fine, but I suspect for the library as a whole, it may be better to incorporate the change within the existing functions.

@haojin2
Copy link
Contributor

haojin2 commented Jul 6, 2018

Actually something I did not mention is compatibility issues, modified APIs will have effects on the exported json file for the models and may cause problems for earlier versions of MXNet. As per the performance issue, it's more than an additional if statement, you can see that it also affects MKLDNN's version.

@apeforest
Copy link
Contributor Author

@haojin2 If we decided to just call _mul_scalar under the hood without actually modifying the backend implementation, then we don't have the compatibility issue right? If that's the case, it seems adding an option to the existing Softmax function to be a more desirable approach.

@slitsey
Copy link

slitsey commented Jul 10, 2018

That sounds like a reasonable approach. How would that affect performance?

@apeforest
Copy link
Contributor Author

If we add an if statement up front instead of inside every operation, the performance impact should be minimal.

@slitsey
Copy link

slitsey commented Jul 11, 2018

@haojin2 Does this approach sound like something you would be comfortable with? It seems like it resolves the issues we've discussed in an elegant and efficient way.

@apeforest apeforest requested a review from nswamy as a code owner July 12, 2018 21:06
@apeforest
Copy link
Contributor Author

apeforest commented Jul 12, 2018

@haojin2 Changed parameter temperature to optional as we discussed. Added unittest test_softmax_with_temperature() to test temperature value ranging from 1 to 10. Please help review again. Thanks!

@haojin2
Copy link
Contributor

haojin2 commented Jul 13, 2018

@piiswrong @reminisce @anirudh2290 @eric-haibin-lin Please give a review on this when you have time, thanks!

@apeforest
Copy link
Contributor Author

@eric-haibin-lin Whom should I ask to merge this PR if it's approved?

for (index_t j = 0; j < M; ++j) {
sum += std::exp(in[base + j*sa] - mmax);
}
if (temperature == 1.0) {
Copy link
Member

@nswamy nswamy Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a comment why you are branching for 1.0. And also the fact this is not useful for GPU.
Generally its a good practice to add comments around obvious code that need special handling or code that you might have discovered after scratching your head :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default the value of temperature is 1.0. Users will use other values only during reinforcement training cases. For CPU, the compiler cannot optimize this "divide-by-1.0" computation at runtime. Therefore I added a branch here. The performance difference is calibrated using an example shown in the Description of this PR. This branch is not added in GPU kernel because branching will add extra overhead for GPU.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant in the code :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to your suggestion, I have added two comments in the code to make it clear for other developers in the future.

@nswamy nswamy merged commit b1f2f44 into apache:master Jul 19, 2018
KellenSunderland pushed a commit to KellenSunderland/incubator-mxnet that referenced this pull request Jul 21, 2018
* Add temperature parameter in softmax operator and add a unit test

* Optimize runtime when temperature is set to default 1.0

* Add temperature parameter in softmax operator and add a unit test
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
* Add temperature parameter in softmax operator and add a unit test

* Optimize runtime when temperature is set to default 1.0

* Add temperature parameter in softmax operator and add a unit test
@apeforest apeforest deleted the feature/operator_enhancement branch August 23, 2019 17:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants