Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Instance Normalization #808

Merged
merged 2 commits into from
Jun 10, 2018
Merged

Implementation of Instance Normalization #808

merged 2 commits into from
Jun 10, 2018

Conversation

SimonKohl
Copy link

Added a layer called 'InstanceNormLayer' that implements instance normalization of its inputs as according to 'Instance Normalization: The Missing Ingredient for Fast Stylization' (https://arxiv.org/pdf/1607.08022.pdf).

While the existing BatchNormLayer is able to implement instance normalization by setting the axes, beta and gamma arguments accordingly, it still requires a fixed batch size as learned parameters need to be initialized.

The proposed InstanceNormLayer gets away with that, allowing for more flexibility.

@f0k
Copy link
Member

f0k commented Mar 2, 2017

Hey, thank you for the PR! This technique has also been termed "Layer normalization" before, see #736, #738, #762. In that case, it also contains the learnable scale and shift as in batch normalization.

While the existing BatchNormLayer is able to implement instance normalization by setting the axes, beta and gamma arguments accordingly, it still requires a fixed batch size as learned parameters need to be initialized.

Yes, that's a bummer. Another disadvantage of using the batch normalization layer for this is that you'll need to pass extra arguments to lasagne.layers.get_output to disable learning and using the exponential running average statistics.
As argued in #736 (comment), I'd prefer to have a NormLayer as a new base class for BatchNormLayer, such that the BatchNormLayer only adds the running average statistics on top of the NormLayer implementation and changes the default axes. This would reduce some code duplication. I like the term "instance normalization", though, we could still use instance_norm for the convenience function. (I think I wouldn't use InstanceNormLayer for the class, because depending on the axes, it could do something else, and it would sound a bit weird to derive batch normalization from instance normalization. On the other hand, depending on the axes, BatchNormLayer also doesn't necessarily normalize a batch. I'm open for other arguments.)

Would you like to update your PR such that NormLayer includes mean and gamma and becomes a base class for BatchNormLayer?

@SimonKohl
Copy link
Author

SimonKohl commented Mar 2, 2017

I'd be happy to. Wouldn't mind some guidance in the process though.

You proposed a base class NormLayer that the deriving class BatchNormLayer would alter w.r.t. default axes and the running average statistics.

  1. What what about the initilialization of gamma and beta? It would have to differ between BatchNormLayer and a prospective InstanceNormLayer (in order to drop requirements on batch size, see above). With InstanceNormLayer becoming a base NormLayer, we'd just overwrite the initializations of gamma and beta in BatchNormLayer?
  2. The get_output_for() method of NormLayer would also be basically overwritten by BatchNormLayer?

@f0k
Copy link
Member

f0k commented Mar 2, 2017

Wouldn't mind some guidance in the process though.

Sure! :)

What what about the initilialization of gamma and beta?

Ah. There was a catch. We want gamma and beta to be shared across the batch, but not normalize data across the batch. This means the interface is actually very different.

Okay, I think we can forget about the common base class and go with an InstanceNormLayer.

Now the question is what to do about layer normalization. Checking their paper again (https://arxiv.org/abs/1607.06450), they normalize across all but the batch axis (i.e., across all features) and learn a scalar shift and scale. I'm leaning towards omitting this. From the paper, layer normalization actually gives each neuron its own scale and shift, so it's not tied to the normalization axes as in batch normalization. It's probably easier to just use a sequence of instance normalization, bias and scale layers for this.

Great, I'll just review your current PR now!

Copy link
Member

@f0k f0k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, and a question about the bias.

In addition, this needs to be added to the docs (layers.rst and layers/normalization.rst), and it needs tests.

@@ -49,6 +49,8 @@
"LocalResponseNormalization2DLayer",
"BatchNormLayer",
"batch_norm",
"InstanceNormLayer",
"instance_norm"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comma in the end. This ensures we can add lines in the end without changing the previous ones. (Just like you did here.)

lasagne.layers.InstanceNormLayer(incoming, axes='auto', epsilon=1e-4,
**kwargs)
Instance Normalization
This layer implements instance normalization of its inputs, following [1]_:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to be missing a newline here. And I'm not sure if [1] needs to be so prominent (but we can leave it in the references). Is there a way to find out who used this first, for proper credit assignment?

class InstanceNormLayer(Layer):
"""
lasagne.layers.InstanceNormLayer(incoming, axes='auto', epsilon=1e-4,
**kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can omit this here. It's only needed if the default arguments include object instances that don't have a nice string representation.

That is, each example (instance) in the input is normalized to zero mean
and unit variance. The crucial part is that the mean and variance are
NOT computed across the batch dimension, i.e., not over examples,
but per example. During both training and testing, :math:`\\mu_i` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should call this "crucial", since that's what you'd expect (if you don't know batch normalization). Maybe weaken this a bit like:
"That is, each input example (instance) is normalized to zero mean and unit variance. In contrast to batch normalization, the mean and variance is usually taken per example, and not across examples, so the same operation can be applied during training and testing."

:func:`lasagne.layers.get_output()` when building the output expression
of any network containing this layer.
During training, [1]_ normalize each instance in an input mini-batch
by its statistics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last two sentences can be omitted. There is no behaviour to be controlled, and I guess there would be earlier references than [1].


self.epsilon = epsilon

def get_output_for(self, input, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this code can be simplified to:

mean = input.mean(self.axes, keepdims=True)
std = T.sqrt(input.var(self.axes, keepdims=True) + self.epsilon)
return (input - mean) / std

will steal the layer's nonlinearity if there is one (effectively
introducing the normalization right before the nonlinearity), remove
the layer's bias if there is one (because it would be redundant), and add
a :class:`InstanceNormLayer` and :class:`NonlinearityLayer` on top.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we need to have a word on that bias. Currently you remove it without giving the network a chance to introduce an offset afterwards. What do the instance normalization authors do in their paper?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this is an important issue to converge on. The instance normalization paper does not mention the bias whatsoever, suggesting that they don't remove it and plainly apply instance norm on top of the layers output. As you mentioned earlier, the 'layer normalization' (e.g. https://arxiv.org/pdf/1607.06450.pdf) is conceptually slightly different to instance norm in that they retain a gain and bias (gamma and beta in BatchNormLayer). We could delete the bias of the preceeding (e.g. conv) layer and retain gain and bias as learnable (basically scalar) parameters in InstanceNormLayer?

Copy link
Author

@SimonKohl SimonKohl Mar 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so my latest commit is a suggestion to cover both ideas, 'layer normalization' and 'instance normalization'. I added beta and gamma as learnable parameters, which are broadcast across the batch-size as well as the trailing dimensions beyond the 2nd dimension. This way 'layer normalization' as in https://arxiv.org/pdf/1607.06450.pdf should be available, as well as 'instance normalization' as in https://arxiv.org/pdf/1607.06450.pdf, when setting beta and gamma to None.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, there's one issue remaining. If beta is set to None, we shouldn't delete the incoming layer's bias. Fixed this in instance_norm() convenience function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instance normalization paper does not mention the bias whatsoever, suggesting that they don't remove it and plainly apply instance norm on top of the layers output.

But that would subtract the bias again. I doubt that's a good idea. Did they provide any code?

I added beta and gamma as learnable parameters, which are broadcast across the batch-size as well as the trailing dimensions beyond the 2nd dimension.

This fits the layer normalization paper, but it's a bit sad we can't control the axes then. I'm torn between leaving it as it is, or not providing beta/gamma at all and offloading this to a ScaleLayer and BiasLayer on top (which could be added automatically by the convenience function if asked for). This would simplify the layer and tests. For the batch normalization layer it was good to have everything in a single layer, because this allows to use cuDNN's batch normalization primitives. For instance normalization, I doubt this will play a role.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jan, sorry I was bit absorbed with other things these past days. Concerning your first comment, that would mean we'd always want to have a learnable bias, just the gamma factor would optionally be learnable, right? With respect to whether or not to include gamma/beta in the way it is currenlty done: Did you converge on a preferred implementation? If so, I could go ahead with the implementation and the corresponding tests.

introducing the normalization right before the nonlinearity), remove
the layer's bias if there is one (because it would be redundant), and add
a :class:`InstanceNormLayer` and :class:`NonlinearityLayer` on top.
Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your missing the newline before.

**kwargs
Any additional keyword arguments are passed on to the
:class:`InstanceNormLayer` constructor.
Returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline before.

>>> from lasagne.nonlinearities import rectify
>>> l1 = InputLayer((10, 3, 28, 28))
>>> l2 = instance_norm(Conv2DLayer(l1, num_filters=64, filter_size=3,
nonlinearity=rectify))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs continuation characters ... at the beginning of the line.

Copy link
Member

@f0k f0k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments. I'm not sure what's the best way for beta and gamma... it would simplify the code, documentation and tests to omit them from the layer and add learn_bias and learn_scale arguments to the instance_norm convenience function that would add a ScaleLayer and BiasLayer if needed.

**kwargs
Any additional keyword arguments are passed to the :class:`Layer`
superclass.
Notes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be an empty line in front of every section header (Parameters, Notes, See also, References), otherwise the documentation will not render correctly.

The axis or axes to normalize over. If ``'auto'`` (the default),
normalize over all axes except for the first two:
this will normalize over all spatial dimensions for
convolutional layers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to be explicit that this does nothing for a dense layer: "over all spatial dimensions for convolutional layers, and do nothing for dense layers."
And then we might want to add another shortcut string to do layer normalization, i.e., normalize over all axes except the first. I'm not sure about the naming, though. Maybe we shouldn't have auto, but spatial for the current default and features for including the first dimension. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new StandardizationLayer, it might make sense to have 'auto' normalize over the second axis for 2D input, and over all but the first two axes for 3D+ input. This would do something useful for dense layers, recurrent layers and convolutional layers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. It's implemented in the latest commit.

convolutional layers.
epsilon : scalar
Small constant :math:`\\epsilon` added to the variance before taking
the square root and dividing by it, to avoid numerical problems
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beta and gamma need to be described here if we want them in the layer.



class InstanceNormLayer(Layer):
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With beta and gamma, you'll need to give the signature here (as in BatchNormLayer). Only without beta and gamma we can just rely on the automatically generated signature for the rendered documentation.

will steal the layer's nonlinearity if there is one (effectively
introducing the normalization right before the nonlinearity), remove
the layer's bias if there is one (because it would be redundant), and add
a :class:`InstanceNormLayer` and :class:`NonlinearityLayer` on top.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instance normalization paper does not mention the bias whatsoever, suggesting that they don't remove it and plainly apply instance norm on top of the layers output.

But that would subtract the bias again. I doubt that's a good idea. Did they provide any code?

I added beta and gamma as learnable parameters, which are broadcast across the batch-size as well as the trailing dimensions beyond the 2nd dimension.

This fits the layer normalization paper, but it's a bit sad we can't control the axes then. I'm torn between leaving it as it is, or not providing beta/gamma at all and offloading this to a ScaleLayer and BiasLayer on top (which could be added automatically by the convenience function if asked for). This would simplify the layer and tests. For the batch normalization layer it was good to have everything in a single layer, because this allows to use cuDNN's batch normalization primitives. For instance normalization, I doubt this will play a role.

:class:`DenseLayer`, or :class:`Conv2DLayer`) and its nonlinearity. The
convenience functions :func:`instance_norm` :func:`layer_norm`
modify an existing layer to insert instance normalization or
layer normalization in front of its nonlinearity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't apply here. For a generic standardization layer, we shouldn't state any rules.

layer_norm : Convenience function to apply layer normalization to a layer
References

----------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You messed up something with the References title here :)

https://arxiv.org/pdf/1607.08022.pdf.

.. [2] Ba, J., Kiros, J., & Hinton, G. (2016):
Layer normalization. arXiv preprint arXiv:1607.06450.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's unify this and give the arxiv link in both cases, but using abs instead of pdf.


def instance_norm(layer, learn_scale=True, learn_bias=True, **kwargs):
"""
Apply instance normalization to an existing layer. This is a convenience
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation is wrong (4 characters too much).

irreversibly modified as specified above
**kwargs
Any additional keyword arguments are passed on to the
:class:`StandardizationLayer` constructor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

learn_scale and learn_bias need to be described.

Copy link
Member

@f0k f0k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and tests pass, thanks and sorry for the long delay! There are still a few issues with the docstrings, but I'll just polish that myself before merging.

@@ -163,6 +163,8 @@
LocalResponseNormalization2DLayer
BatchNormLayer
batch_norm
InstanceNormLayer
instance_norm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match the code any more. Should have StandardizationLayer, instance_norm and layer_norm instead.

:members:

.. autofunction:: instance_norm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match the code any more. Should have StandardizationLayer, instance_norm and layer_norm instead.

@f0k f0k self-assigned this Jul 31, 2017
@f0k f0k mentioned this pull request Feb 19, 2018
@f0k f0k merged commit a61b76f into Lasagne:master Jun 10, 2018
@f0k
Copy link
Member

f0k commented Jun 10, 2018

Finally finished and merged -- thanks again and sorry for the long pause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants