-
Notifications
You must be signed in to change notification settings - Fork 950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of Instance Normalization #808
Conversation
Hey, thank you for the PR! This technique has also been termed "Layer normalization" before, see #736, #738, #762. In that case, it also contains the learnable scale and shift as in batch normalization.
Yes, that's a bummer. Another disadvantage of using the batch normalization layer for this is that you'll need to pass extra arguments to Would you like to update your PR such that |
I'd be happy to. Wouldn't mind some guidance in the process though. You proposed a base class
|
Sure! :)
Ah. There was a catch. We want Okay, I think we can forget about the common base class and go with an Now the question is what to do about layer normalization. Checking their paper again (https://arxiv.org/abs/1607.06450), they normalize across all but the batch axis (i.e., across all features) and learn a scalar shift and scale. I'm leaning towards omitting this. From the paper, layer normalization actually gives each neuron its own scale and shift, so it's not tied to the normalization axes as in batch normalization. It's probably easier to just use a sequence of instance normalization, bias and scale layers for this. Great, I'll just review your current PR now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, and a question about the bias.
In addition, this needs to be added to the docs (layers.rst
and layers/normalization.rst
), and it needs tests.
lasagne/layers/normalization.py
Outdated
@@ -49,6 +49,8 @@ | |||
"LocalResponseNormalization2DLayer", | |||
"BatchNormLayer", | |||
"batch_norm", | |||
"InstanceNormLayer", | |||
"instance_norm" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comma in the end. This ensures we can add lines in the end without changing the previous ones. (Just like you did here.)
lasagne/layers/normalization.py
Outdated
lasagne.layers.InstanceNormLayer(incoming, axes='auto', epsilon=1e-4, | ||
**kwargs) | ||
Instance Normalization | ||
This layer implements instance normalization of its inputs, following [1]_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to be missing a newline here. And I'm not sure if [1] needs to be so prominent (but we can leave it in the references). Is there a way to find out who used this first, for proper credit assignment?
lasagne/layers/normalization.py
Outdated
class InstanceNormLayer(Layer): | ||
""" | ||
lasagne.layers.InstanceNormLayer(incoming, axes='auto', epsilon=1e-4, | ||
**kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can omit this here. It's only needed if the default arguments include object instances that don't have a nice string representation.
lasagne/layers/normalization.py
Outdated
That is, each example (instance) in the input is normalized to zero mean | ||
and unit variance. The crucial part is that the mean and variance are | ||
NOT computed across the batch dimension, i.e., not over examples, | ||
but per example. During both training and testing, :math:`\\mu_i` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should call this "crucial", since that's what you'd expect (if you don't know batch normalization). Maybe weaken this a bit like:
"That is, each input example (instance) is normalized to zero mean and unit variance. In contrast to batch normalization, the mean and variance is usually taken per example, and not across examples, so the same operation can be applied during training and testing."
lasagne/layers/normalization.py
Outdated
:func:`lasagne.layers.get_output()` when building the output expression | ||
of any network containing this layer. | ||
During training, [1]_ normalize each instance in an input mini-batch | ||
by its statistics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last two sentences can be omitted. There is no behaviour to be controlled, and I guess there would be earlier references than [1].
|
||
self.epsilon = epsilon | ||
|
||
def get_output_for(self, input, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this code can be simplified to:
mean = input.mean(self.axes, keepdims=True)
std = T.sqrt(input.var(self.axes, keepdims=True) + self.epsilon)
return (input - mean) / std
lasagne/layers/normalization.py
Outdated
will steal the layer's nonlinearity if there is one (effectively | ||
introducing the normalization right before the nonlinearity), remove | ||
the layer's bias if there is one (because it would be redundant), and add | ||
a :class:`InstanceNormLayer` and :class:`NonlinearityLayer` on top. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we need to have a word on that bias. Currently you remove it without giving the network a chance to introduce an offset afterwards. What do the instance normalization authors do in their paper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this is an important issue to converge on. The instance normalization paper does not mention the bias whatsoever, suggesting that they don't remove it and plainly apply instance norm on top of the layers output. As you mentioned earlier, the 'layer normalization' (e.g. https://arxiv.org/pdf/1607.06450.pdf) is conceptually slightly different to instance norm in that they retain a gain and bias (gamma and beta in BatchNormLayer
). We could delete the bias of the preceeding (e.g. conv) layer and retain gain and bias as learnable (basically scalar) parameters in InstanceNormLayer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so my latest commit is a suggestion to cover both ideas, 'layer normalization' and 'instance normalization'. I added beta
and gamma
as learnable parameters, which are broadcast across the batch-size as well as the trailing dimensions beyond the 2nd dimension. This way 'layer normalization' as in https://arxiv.org/pdf/1607.06450.pdf should be available, as well as 'instance normalization' as in https://arxiv.org/pdf/1607.06450.pdf, when setting beta
and gamma
to None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, there's one issue remaining. If beta
is set to None, we shouldn't delete the incoming layer's bias. Fixed this in instance_norm()
convenience function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instance normalization paper does not mention the bias whatsoever, suggesting that they don't remove it and plainly apply instance norm on top of the layers output.
But that would subtract the bias again. I doubt that's a good idea. Did they provide any code?
I added beta and gamma as learnable parameters, which are broadcast across the batch-size as well as the trailing dimensions beyond the 2nd dimension.
This fits the layer normalization paper, but it's a bit sad we can't control the axes then. I'm torn between leaving it as it is, or not providing beta/gamma at all and offloading this to a ScaleLayer and BiasLayer on top (which could be added automatically by the convenience function if asked for). This would simplify the layer and tests. For the batch normalization layer it was good to have everything in a single layer, because this allows to use cuDNN's batch normalization primitives. For instance normalization, I doubt this will play a role.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Jan, sorry I was bit absorbed with other things these past days. Concerning your first comment, that would mean we'd always want to have a learnable bias, just the gamma factor would optionally be learnable, right? With respect to whether or not to include gamma/beta in the way it is currenlty done: Did you converge on a preferred implementation? If so, I could go ahead with the implementation and the corresponding tests.
lasagne/layers/normalization.py
Outdated
introducing the normalization right before the nonlinearity), remove | ||
the layer's bias if there is one (because it would be redundant), and add | ||
a :class:`InstanceNormLayer` and :class:`NonlinearityLayer` on top. | ||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your missing the newline before.
lasagne/layers/normalization.py
Outdated
**kwargs | ||
Any additional keyword arguments are passed on to the | ||
:class:`InstanceNormLayer` constructor. | ||
Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing newline before.
lasagne/layers/normalization.py
Outdated
>>> from lasagne.nonlinearities import rectify | ||
>>> l1 = InputLayer((10, 3, 28, 28)) | ||
>>> l2 = instance_norm(Conv2DLayer(l1, num_filters=64, filter_size=3, | ||
nonlinearity=rectify)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs continuation characters ...
at the beginning of the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments. I'm not sure what's the best way for beta and gamma... it would simplify the code, documentation and tests to omit them from the layer and add learn_bias
and learn_scale
arguments to the instance_norm
convenience function that would add a ScaleLayer
and BiasLayer
if needed.
lasagne/layers/normalization.py
Outdated
**kwargs | ||
Any additional keyword arguments are passed to the :class:`Layer` | ||
superclass. | ||
Notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There needs to be an empty line in front of every section header (Parameters, Notes, See also, References), otherwise the documentation will not render correctly.
lasagne/layers/normalization.py
Outdated
The axis or axes to normalize over. If ``'auto'`` (the default), | ||
normalize over all axes except for the first two: | ||
this will normalize over all spatial dimensions for | ||
convolutional layers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to be explicit that this does nothing for a dense layer: "over all spatial dimensions for convolutional layers, and do nothing for dense layers."
And then we might want to add another shortcut string to do layer normalization, i.e., normalize over all axes except the first. I'm not sure about the naming, though. Maybe we shouldn't have auto
, but spatial
for the current default and features
for including the first dimension. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new StandardizationLayer
, it might make sense to have 'auto'
normalize over the second axis for 2D input, and over all but the first two axes for 3D+ input. This would do something useful for dense layers, recurrent layers and convolutional layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. It's implemented in the latest commit.
lasagne/layers/normalization.py
Outdated
convolutional layers. | ||
epsilon : scalar | ||
Small constant :math:`\\epsilon` added to the variance before taking | ||
the square root and dividing by it, to avoid numerical problems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beta
and gamma
need to be described here if we want them in the layer.
lasagne/layers/normalization.py
Outdated
|
||
|
||
class InstanceNormLayer(Layer): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With beta and gamma, you'll need to give the signature here (as in BatchNormLayer). Only without beta and gamma we can just rely on the automatically generated signature for the rendered documentation.
lasagne/layers/normalization.py
Outdated
will steal the layer's nonlinearity if there is one (effectively | ||
introducing the normalization right before the nonlinearity), remove | ||
the layer's bias if there is one (because it would be redundant), and add | ||
a :class:`InstanceNormLayer` and :class:`NonlinearityLayer` on top. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instance normalization paper does not mention the bias whatsoever, suggesting that they don't remove it and plainly apply instance norm on top of the layers output.
But that would subtract the bias again. I doubt that's a good idea. Did they provide any code?
I added beta and gamma as learnable parameters, which are broadcast across the batch-size as well as the trailing dimensions beyond the 2nd dimension.
This fits the layer normalization paper, but it's a bit sad we can't control the axes then. I'm torn between leaving it as it is, or not providing beta/gamma at all and offloading this to a ScaleLayer and BiasLayer on top (which could be added automatically by the convenience function if asked for). This would simplify the layer and tests. For the batch normalization layer it was good to have everything in a single layer, because this allows to use cuDNN's batch normalization primitives. For instance normalization, I doubt this will play a role.
lasagne/layers/normalization.py
Outdated
:class:`DenseLayer`, or :class:`Conv2DLayer`) and its nonlinearity. The | ||
convenience functions :func:`instance_norm` :func:`layer_norm` | ||
modify an existing layer to insert instance normalization or | ||
layer normalization in front of its nonlinearity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this doesn't apply here. For a generic standardization layer, we shouldn't state any rules.
lasagne/layers/normalization.py
Outdated
layer_norm : Convenience function to apply layer normalization to a layer | ||
References | ||
|
||
---------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You messed up something with the References
title here :)
lasagne/layers/normalization.py
Outdated
https://arxiv.org/pdf/1607.08022.pdf. | ||
|
||
.. [2] Ba, J., Kiros, J., & Hinton, G. (2016): | ||
Layer normalization. arXiv preprint arXiv:1607.06450. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's unify this and give the arxiv link in both cases, but using abs
instead of pdf
.
lasagne/layers/normalization.py
Outdated
|
||
def instance_norm(layer, learn_scale=True, learn_bias=True, **kwargs): | ||
""" | ||
Apply instance normalization to an existing layer. This is a convenience |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation is wrong (4 characters too much).
lasagne/layers/normalization.py
Outdated
irreversibly modified as specified above | ||
**kwargs | ||
Any additional keyword arguments are passed on to the | ||
:class:`StandardizationLayer` constructor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
learn_scale
and learn_bias
need to be described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and tests pass, thanks and sorry for the long delay! There are still a few issues with the docstrings, but I'll just polish that myself before merging.
docs/modules/layers.rst
Outdated
@@ -163,6 +163,8 @@ | |||
LocalResponseNormalization2DLayer | |||
BatchNormLayer | |||
batch_norm | |||
InstanceNormLayer | |||
instance_norm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't match the code any more. Should have StandardizationLayer, instance_norm and layer_norm instead.
:members: | ||
|
||
.. autofunction:: instance_norm | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't match the code any more. Should have StandardizationLayer, instance_norm and layer_norm instead.
Finally finished and merged -- thanks again and sorry for the long pause. |
Added a layer called 'InstanceNormLayer' that implements instance normalization of its inputs as according to 'Instance Normalization: The Missing Ingredient for Fast Stylization' (https://arxiv.org/pdf/1607.08022.pdf).
While the existing BatchNormLayer is able to implement instance normalization by setting the axes, beta and gamma arguments accordingly, it still requires a fixed batch size as learned parameters need to be initialized.
The proposed InstanceNormLayer gets away with that, allowing for more flexibility.