Implementation of Instance Normalization #808

SimonKohl · 2017-03-02T10:41:31Z

Added a layer called 'InstanceNormLayer' that implements instance normalization of its inputs as according to 'Instance Normalization: The Missing Ingredient for Fast Stylization' (https://arxiv.org/pdf/1607.08022.pdf).

While the existing BatchNormLayer is able to implement instance normalization by setting the axes, beta and gamma arguments accordingly, it still requires a fixed batch size as learned parameters need to be initialized.

The proposed InstanceNormLayer gets away with that, allowing for more flexibility.

f0k · 2017-03-02T11:28:53Z

Hey, thank you for the PR! This technique has also been termed "Layer normalization" before, see #736, #738, #762. In that case, it also contains the learnable scale and shift as in batch normalization.

While the existing BatchNormLayer is able to implement instance normalization by setting the axes, beta and gamma arguments accordingly, it still requires a fixed batch size as learned parameters need to be initialized.

Yes, that's a bummer. Another disadvantage of using the batch normalization layer for this is that you'll need to pass extra arguments to lasagne.layers.get_output to disable learning and using the exponential running average statistics.
As argued in #736 (comment), I'd prefer to have a NormLayer as a new base class for BatchNormLayer, such that the BatchNormLayer only adds the running average statistics on top of the NormLayer implementation and changes the default axes. This would reduce some code duplication. I like the term "instance normalization", though, we could still use instance_norm for the convenience function. (I think I wouldn't use InstanceNormLayer for the class, because depending on the axes, it could do something else, and it would sound a bit weird to derive batch normalization from instance normalization. On the other hand, depending on the axes, BatchNormLayer also doesn't necessarily normalize a batch. I'm open for other arguments.)

Would you like to update your PR such that NormLayer includes mean and gamma and becomes a base class for BatchNormLayer?

SimonKohl · 2017-03-02T12:04:56Z

I'd be happy to. Wouldn't mind some guidance in the process though.

You proposed a base class NormLayer that the deriving class BatchNormLayer would alter w.r.t. default axes and the running average statistics.

What what about the initilialization of gamma and beta? It would have to differ between BatchNormLayer and a prospective InstanceNormLayer (in order to drop requirements on batch size, see above). With InstanceNormLayer becoming a base NormLayer, we'd just overwrite the initializations of gamma and beta in BatchNormLayer?
The get_output_for() method of NormLayer would also be basically overwritten by BatchNormLayer?

f0k · 2017-03-02T16:42:15Z

Wouldn't mind some guidance in the process though.

Sure! :)

What what about the initilialization of gamma and beta?

Ah. There was a catch. We want gamma and beta to be shared across the batch, but not normalize data across the batch. This means the interface is actually very different.

Okay, I think we can forget about the common base class and go with an InstanceNormLayer.

Now the question is what to do about layer normalization. Checking their paper again (https://arxiv.org/abs/1607.06450), they normalize across all but the batch axis (i.e., across all features) and learn a scalar shift and scale. I'm leaning towards omitting this. From the paper, layer normalization actually gives each neuron its own scale and shift, so it's not tied to the normalization axes as in batch normalization. It's probably easier to just use a sequence of instance normalization, bias and scale layers for this.

Great, I'll just review your current PR now!

f0k

A few comments, and a question about the bias.

In addition, this needs to be added to the docs (layers.rst and layers/normalization.rst), and it needs tests.

f0k · 2017-03-02T16:43:35Z

lasagne/layers/normalization.py

@@ -49,6 +49,8 @@
 "LocalResponseNormalization2DLayer",
 "BatchNormLayer",
 "batch_norm",
+ "InstanceNormLayer",
+ "instance_norm"


Please add a comma in the end. This ensures we can add lines in the end without changing the previous ones. (Just like you did here.)

f0k · 2017-03-02T16:44:04Z

lasagne/layers/normalization.py

+ lasagne.layers.InstanceNormLayer(incoming, axes='auto', epsilon=1e-4,
+ **kwargs)
+ Instance Normalization
+ This layer implements instance normalization of its inputs, following [1]_:


You seem to be missing a newline here. And I'm not sure if [1] needs to be so prominent (but we can leave it in the references). Is there a way to find out who used this first, for proper credit assignment?

f0k · 2017-03-02T16:45:47Z

lasagne/layers/normalization.py

+class InstanceNormLayer(Layer):
+ """
+ lasagne.layers.InstanceNormLayer(incoming, axes='auto', epsilon=1e-4,
+ **kwargs)


You can omit this here. It's only needed if the default arguments include object instances that don't have a nice string representation.

f0k · 2017-03-02T16:51:13Z

lasagne/layers/normalization.py

+ That is, each example (instance) in the input is normalized to zero mean
+ and unit variance. The crucial part is that the mean and variance are
+ NOT computed across the batch dimension, i.e., not over examples,
+ but per example. During both training and testing, :math:`\\mu_i` and


Not sure if we should call this "crucial", since that's what you'd expect (if you don't know batch normalization). Maybe weaken this a bit like:
"That is, each input example (instance) is normalized to zero mean and unit variance. In contrast to batch normalization, the mean and variance is usually taken per example, and not across examples, so the same operation can be applied during training and testing."

f0k · 2017-03-02T16:57:38Z

lasagne/layers/normalization.py

+ :func:`lasagne.layers.get_output()` when building the output expression
+ of any network containing this layer.
+ During training, [1]_ normalize each instance in an input mini-batch
+ by its statistics.


The last two sentences can be omitted. There is no behaviour to be controlled, and I guess there would be earlier references than [1].

f0k · 2017-03-02T17:04:00Z

lasagne/layers/normalization.py

+
+ self.epsilon = epsilon
+
+ def get_output_for(self, input, **kwargs):


All this code can be simplified to:

mean = input.mean(self.axes, keepdims=True) std = T.sqrt(input.var(self.axes, keepdims=True) + self.epsilon) return (input - mean) / std

f0k · 2017-03-02T17:06:05Z

lasagne/layers/normalization.py

+ will steal the layer's nonlinearity if there is one (effectively
+ introducing the normalization right before the nonlinearity), remove
+ the layer's bias if there is one (because it would be redundant), and add
+ a :class:`InstanceNormLayer` and :class:`NonlinearityLayer` on top.


Hmm, we need to have a word on that bias. Currently you remove it without giving the network a chance to introduce an offset afterwards. What do the instance normalization authors do in their paper?

You're right, this is an important issue to converge on. The instance normalization paper does not mention the bias whatsoever, suggesting that they don't remove it and plainly apply instance norm on top of the layers output. As you mentioned earlier, the 'layer normalization' (e.g. https://arxiv.org/pdf/1607.06450.pdf) is conceptually slightly different to instance norm in that they retain a gain and bias (gamma and beta in BatchNormLayer). We could delete the bias of the preceeding (e.g. conv) layer and retain gain and bias as learnable (basically scalar) parameters in InstanceNormLayer?

Ok, so my latest commit is a suggestion to cover both ideas, 'layer normalization' and 'instance normalization'. I added beta and gamma as learnable parameters, which are broadcast across the batch-size as well as the trailing dimensions beyond the 2nd dimension. This way 'layer normalization' as in https://arxiv.org/pdf/1607.06450.pdf should be available, as well as 'instance normalization' as in https://arxiv.org/pdf/1607.06450.pdf, when setting beta and gamma to None.

Ah, there's one issue remaining. If beta is set to None, we shouldn't delete the incoming layer's bias. Fixed this in instance_norm() convenience function.

The instance normalization paper does not mention the bias whatsoever, suggesting that they don't remove it and plainly apply instance norm on top of the layers output.

But that would subtract the bias again. I doubt that's a good idea. Did they provide any code?

I added beta and gamma as learnable parameters, which are broadcast across the batch-size as well as the trailing dimensions beyond the 2nd dimension.

This fits the layer normalization paper, but it's a bit sad we can't control the axes then. I'm torn between leaving it as it is, or not providing beta/gamma at all and offloading this to a ScaleLayer and BiasLayer on top (which could be added automatically by the convenience function if asked for). This would simplify the layer and tests. For the batch normalization layer it was good to have everything in a single layer, because this allows to use cuDNN's batch normalization primitives. For instance normalization, I doubt this will play a role.

Hi Jan, sorry I was bit absorbed with other things these past days. Concerning your first comment, that would mean we'd always want to have a learnable bias, just the gamma factor would optionally be learnable, right? With respect to whether or not to include gamma/beta in the way it is currenlty done: Did you converge on a preferred implementation? If so, I could go ahead with the implementation and the corresponding tests.

f0k · 2017-03-02T17:06:18Z

lasagne/layers/normalization.py

+ introducing the normalization right before the nonlinearity), remove
+ the layer's bias if there is one (because it would be redundant), and add
+ a :class:`InstanceNormLayer` and :class:`NonlinearityLayer` on top.
+ Parameters


Your missing the newline before.

f0k · 2017-03-02T17:06:26Z

lasagne/layers/normalization.py

+ **kwargs
+ Any additional keyword arguments are passed on to the
+ :class:`InstanceNormLayer` constructor.
+ Returns


Missing newline before.

f0k · 2017-03-02T17:07:11Z

lasagne/layers/normalization.py

+ >>> from lasagne.nonlinearities import rectify
+ >>> l1 = InputLayer((10, 3, 28, 28))
+ >>> l2 = instance_norm(Conv2DLayer(l1, num_filters=64, filter_size=3,
+ nonlinearity=rectify))


Needs continuation characters ... at the beginning of the line.

f0k

Some more comments. I'm not sure what's the best way for beta and gamma... it would simplify the code, documentation and tests to omit them from the layer and add learn_bias and learn_scale arguments to the instance_norm convenience function that would add a ScaleLayer and BiasLayer if needed.

f0k · 2017-03-05T13:47:21Z

lasagne/layers/normalization.py

+ **kwargs
+ Any additional keyword arguments are passed to the :class:`Layer`
+ superclass.
+ Notes


There needs to be an empty line in front of every section header (Parameters, Notes, See also, References), otherwise the documentation will not render correctly.

f0k · 2017-03-05T13:51:36Z

lasagne/layers/normalization.py

+ The axis or axes to normalize over. If ``'auto'`` (the default),
+ normalize over all axes except for the first two:
+ this will normalize over all spatial dimensions for
+ convolutional layers.


We might want to be explicit that this does nothing for a dense layer: "over all spatial dimensions for convolutional layers, and do nothing for dense layers."
And then we might want to add another shortcut string to do layer normalization, i.e., normalize over all axes except the first. I'm not sure about the naming, though. Maybe we shouldn't have auto, but spatial for the current default and features for including the first dimension. What do you think?

With the new StandardizationLayer, it might make sense to have 'auto' normalize over the second axis for 2D input, and over all but the first two axes for 3D+ input. This would do something useful for dense layers, recurrent layers and convolutional layers.

Sounds good. It's implemented in the latest commit.

f0k · 2017-03-05T13:56:24Z

lasagne/layers/normalization.py

+ convolutional layers.
+ epsilon : scalar
+ Small constant :math:`\\epsilon` added to the variance before taking
+ the square root and dividing by it, to avoid numerical problems


beta and gamma need to be described here if we want them in the layer.

f0k · 2017-03-05T13:57:19Z

lasagne/layers/normalization.py

+
+
+class InstanceNormLayer(Layer):
+ """


With beta and gamma, you'll need to give the signature here (as in BatchNormLayer). Only without beta and gamma we can just rely on the automatically generated signature for the rendered documentation.

f0k · 2017-03-05T14:08:46Z

lasagne/layers/normalization.py

+ will steal the layer's nonlinearity if there is one (effectively
+ introducing the normalization right before the nonlinearity), remove
+ the layer's bias if there is one (because it would be redundant), and add
+ a :class:`InstanceNormLayer` and :class:`NonlinearityLayer` on top.


The instance normalization paper does not mention the bias whatsoever, suggesting that they don't remove it and plainly apply instance norm on top of the layers output.

But that would subtract the bias again. I doubt that's a good idea. Did they provide any code?

I added beta and gamma as learnable parameters, which are broadcast across the batch-size as well as the trailing dimensions beyond the 2nd dimension.

This fits the layer normalization paper, but it's a bit sad we can't control the axes then. I'm torn between leaving it as it is, or not providing beta/gamma at all and offloading this to a ScaleLayer and BiasLayer on top (which could be added automatically by the convenience function if asked for). This would simplify the layer and tests. For the batch normalization layer it was good to have everything in a single layer, because this allows to use cuDNN's batch normalization primitives. For instance normalization, I doubt this will play a role.

f0k · 2017-04-26T17:07:28Z

lasagne/layers/normalization.py

+ :class:`DenseLayer`, or :class:`Conv2DLayer`) and its nonlinearity. The
+ convenience functions :func:`instance_norm` :func:`layer_norm`
+ modify an existing layer to insert instance normalization or
+ layer normalization in front of its nonlinearity.


I think this doesn't apply here. For a generic standardization layer, we shouldn't state any rules.

f0k · 2017-04-26T17:07:45Z

lasagne/layers/normalization.py

+ layer_norm : Convenience function to apply layer normalization to a layer
+ References
+
+ ----------


You messed up something with the References title here :)

f0k · 2017-04-26T17:08:39Z

lasagne/layers/normalization.py

+ https://arxiv.org/pdf/1607.08022.pdf.
+
+ .. [2] Ba, J., Kiros, J., & Hinton, G. (2016):
+ Layer normalization. arXiv preprint arXiv:1607.06450.


Let's unify this and give the arxiv link in both cases, but using abs instead of pdf.

f0k · 2017-04-26T17:10:03Z

lasagne/layers/normalization.py

+
+def instance_norm(layer, learn_scale=True, learn_bias=True, **kwargs):
+ """
+ Apply instance normalization to an existing layer. This is a convenience


The indentation is wrong (4 characters too much).

f0k · 2017-04-26T17:11:43Z

lasagne/layers/normalization.py

+ irreversibly modified as specified above
+ **kwargs
+ Any additional keyword arguments are passed on to the
+ :class:`StandardizationLayer` constructor.


learn_scale and learn_bias need to be described.

f0k

Looks good and tests pass, thanks and sorry for the long delay! There are still a few issues with the docstrings, but I'll just polish that myself before merging.

f0k · 2017-07-31T14:07:54Z

docs/modules/layers.rst

@@ -163,6 +163,8 @@
 LocalResponseNormalization2DLayer
 BatchNormLayer
 batch_norm
+ InstanceNormLayer
+ instance_norm


This doesn't match the code any more. Should have StandardizationLayer, instance_norm and layer_norm instead.

f0k · 2017-07-31T14:08:00Z

docs/modules/layers/normalization.rst

+ :members:
+
+.. autofunction:: instance_norm
+


This doesn't match the code any more. Should have StandardizationLayer, instance_norm and layer_norm instead.

…tion

f0k · 2018-06-10T20:38:24Z

Finally finished and merged -- thanks again and sorry for the long pause.

f0k requested changes Mar 2, 2017

View reviewed changes

f0k requested changes Mar 5, 2017

View reviewed changes

f0k reviewed Apr 26, 2017

View reviewed changes

f0k requested changes Jul 31, 2017

View reviewed changes

f0k self-assigned this Jul 31, 2017

f0k mentioned this pull request Feb 19, 2018

Refactor Upscale*DLayer #737

Open

f0k approved these changes Jun 6, 2018

View reviewed changes

SimonKohl and others added 2 commits June 8, 2018 14:49

added an implementation of instance normalization and layer normaliza…

075e1a5

…tion

Correct/refine StandardizationLayer, instance_norm and layer_norm docs

18f7ee0

f0k merged commit a61b76f into Lasagne:master Jun 10, 2018


		self.epsilon = epsilon

		def get_output_for(self, input, **kwargs):

Implementation of Instance Normalization #808

Implementation of Instance Normalization #808

Conversation

SimonKohl commented Mar 2, 2017

f0k commented Mar 2, 2017

SimonKohl commented Mar 2, 2017 • edited Loading

f0k commented Mar 2, 2017 • edited Loading

f0k left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SimonKohl Mar 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

f0k left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

f0k left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

f0k commented Jun 10, 2018

SimonKohl commented Mar 2, 2017 •

edited

Loading

f0k commented Mar 2, 2017 •

edited

Loading

SimonKohl Mar 3, 2017 •

edited

Loading