[MXNET-139] Tutorial for mixed precision training with float16 #10391

rahul003 · 2018-04-04T06:10:57Z

Description

Adds a FAQ page on mixed precision training with float16. Explains usage for both Gluon and Symbolic, and discusses tips to improve performance and accuracy when using mixed precision.

https://issues.apache.org/jira/browse/MXNET-139

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Adds tutorial
Updates symbolic examples for training and fine-tuning
Updates Gluon example

…o docs-fp16-new

eric-haibin-lin · 2018-04-05T03:45:38Z

docs/tutorials/python/float16.md

+Note the accuracy you observe above. You can change DTYPE above to float32 if you want to observe the speedup gained by using float16.
+
+
+### Finetuning


Should be "Fine-tuning" ?

eric-haibin-lin · 2018-04-05T03:57:10Z

docs/tutorials/python/float16.md

@@ -0,0 +1,280 @@
+# Mixed precision training using float16
+
+The computational resources required for training deep neural networks has been increasing of late because of complexity of the architectures and size of models. Mixed precision training allows us to reduces the resources required by using lower precision arithmetic. In this approach we train using 16 bit floating points (half precision) while using 32 bit floating points (single precision) for output buffers of float16 computation. This combination of single and half precision gives rise to the name Mixed precision. It allows us to achieve the same accuracy as training with single precision, while decreasing the required memory and training or inference time.


resources required for training deep neural networks has ->
resources required for training deep neural networks have

gives rise to the name Mixed precision: why capital M?

eric-haibin-lin · 2018-04-05T03:59:27Z

docs/tutorials/python/float16.md

+
+The float16 data type, is a 16 bit floating point representation according to the IEEE 754 standard. It has a dynamic range where the precision can go from 0.0000000596046 (highest, for values closest to 0) to 32 (lowest, for values in the range 32768-65536). Despite the decreased precision when compared to single precision (float32), float16 computation can be much faster on supported hardware. The motivation for using float16 for deep learning comes from the idea that deep neural network architectures have natural resilience to errors due to backpropagation. Half precision is typically sufficient for training neural networks. This means that on hardware with specialized support for float16 computation we can greatly improve the speed of training and inference. This speedup results from faster matrix multiplication, saving on memory bandwidth and reduced communication costs. It also reduces the size of the model, allowing us to train larger models and use larger batch sizes. 
+
+The Volta range of Graphics Processing Units (GPUs) from Nvidia have Tensor Cores which perform efficient float16 computation. A tensor core allows accumulation of half precision products into single or half precision outputs. For the rest of this tutorial we assume that we are working with Nvidia's Tensor Cores on a Volta GPU.


Put a reference link to Tensor Cores?

eric-haibin-lin · 2018-04-05T04:00:50Z

docs/tutorials/python/float16.md

+2. Cast the data to float16 to match the input type expected by the blocks if necessary.
+
+### Training
+Let us look at an example of training a Resnet50 model with the Caltech101 dataset with float16. 


Add a reference link to the dataset description?

eric-haibin-lin · 2018-04-05T04:02:08Z

docs/tutorials/python/float16.md

+from mxnet.gluon.data.vision.datasets import ImageFolderDataset
+```
+
+Let us start by fetching the Caltech101 dataset and extracting it. 


Could you add a reminder of how big the dataset is (num images, number of GBs)

eric-haibin-lin · 2018-04-05T04:04:18Z

docs/tutorials/python/float16.md

+- Volta range of Nvidia GPUs
+- Cuda 9 or higher
+- CUDNN v7 or higher
+


Could you start with an overview that the tutorial covers both Gluon and Symbolic APIs?

eric-haibin-lin · 2018-04-05T04:05:29Z

docs/tutorials/python/float16.md

+ return net
+```
+
+It is preferable to use **multi_precision mode of optimizer** when training in float16. This mode of optimizer maintains the weights in float32 even when the training is in float16. This helps increase precision of the weights and leads to faster convergence for some networks. (Further discussion on this towards the end.)


Do all optimizers support this mode?

SGD supports this natively, as in there's a special kernel for that. Other optimizers support this by making a copy in Python, which can be slightly slower.

mli · 2018-04-05T19:57:42Z

I feel this tutorial is a little bit too complex to reader to follow.

show how to write a simple dense and conv layer to use fp16, show performance numbers
show an end-to-end example based on another tutorial. such as http:https://gluon-crash-course.mxnet.io/ (it's not a good exmaple because too simplified), mentioned to users that read it first. and this tutorial will only highlight the difference for using fp16.
show performance numbers and training curves

aaronmarkham

Style:
Use the pattern of 'you' instead of 'we'. It's ok to say we prepared this tutorial, but the steps and the prerequisites are for 'you'. Please make this update throughout.

Otherwise, a few other suggestions inline.

aaronmarkham · 2018-04-06T19:25:17Z

docs/tutorials/python/float16.md

+
+The computational resources required for training deep neural networks has been increasing of late because of complexity of the architectures and size of models. Mixed precision training allows us to reduces the resources required by using lower precision arithmetic. In this approach we train using 16 bit floating points (half precision) while using 32 bit floating points (single precision) for output buffers of float16 computation. This combination of single and half precision gives rise to the name Mixed precision. It allows us to achieve the same accuracy as training with single precision, while decreasing the required memory and training or inference time.
+
+The float16 data type, is a 16 bit floating point representation according to the IEEE 754 standard. It has a dynamic range where the precision can go from 0.0000000596046 (highest, for values closest to 0) to 32 (lowest, for values in the range 32768-65536). Despite the decreased precision when compared to single precision (float32), float16 computation can be much faster on supported hardware. The motivation for using float16 for deep learning comes from the idea that deep neural network architectures have natural resilience to errors due to backpropagation. Half precision is typically sufficient for training neural networks. This means that on hardware with specialized support for float16 computation we can greatly improve the speed of training and inference. This speedup results from faster matrix multiplication, saving on memory bandwidth and reduced communication costs. It also reduces the size of the model, allowing us to train larger models and use larger batch sizes. 


no comma after type

aaronmarkham · 2018-04-06T19:28:24Z

docs/tutorials/python/float16.md

+
+The Volta range of Graphics Processing Units (GPUs) from Nvidia have Tensor Cores which perform efficient float16 computation. A tensor core allows accumulation of half precision products into single or half precision outputs. For the rest of this tutorial we assume that we are working with Nvidia's Tensor Cores on a Volta GPU.
+
+In this tutorial we will walk through how one can train deep learning neural networks with mixed precision on supported hardware. We will first see how to use float16 and then some techniques on achieving good performance and accuracy.


I'd move this to the top as the main intro, then use ## Background for the rest.

aaronmarkham · 2018-04-06T19:31:22Z

docs/tutorials/python/float16.md

+
+The Volta range of Graphics Processing Units (GPUs) from Nvidia have Tensor Cores which perform efficient float16 computation. A tensor core allows accumulation of half precision products into single or half precision outputs. For the rest of this tutorial we assume that we are working with Nvidia's Tensor Cores on a Volta GPU.
+
+In this tutorial we will walk through how one can train deep learning neural networks with mixed precision on supported hardware. We will first see how to use float16 and then some techniques on achieving good performance and accuracy.


In this tutorial you will learn how you can train...
You will first see how...

Please continue on in this pattern.

aaronmarkham · 2018-04-06T19:33:01Z

docs/tutorials/python/float16.md

+test_data = gluon.data.DataLoader(dataset_test, BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS)
+```
+
+Next, we'll define softmax cross entropy as our loss, accuracy as our metric and the context on which to run our training jobs. It is set by default to gpu. Please note that float16 on CPU might not be supported for all operators, as float16 on CPU is slower than float32.


GPU or gpu?

aaronmarkham · 2018-04-06T19:34:44Z

docs/tutorials/python/float16.md

+
+### Finetuning
+
+You can also finetune in float16, a model which was originally trained in float32. The section of the code which builds the network would now look as follows. We first fetch the pretrained resnet50_v2 model from model zoo. This was trained using Imagenet data, so we need to pass classes as 1000 for fetching the pretrained model. Then we create our new network for Caltech 101 by passing number of classes as 101. We will then cast it to `float16` so that we cast all parameters to `float16`.


the model zoo.

aaronmarkham · 2018-04-06T19:36:10Z

docs/tutorials/python/float16.md

+
+There are a few examples of building such networks which can handle float16 input in [examples/image-classification/symbols/](https://github.com/apache/incubator-mxnet/tree/master/example/image-classification/symbols). Specifically you could look at the [resnet](https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/symbols/resnet.py) example.
+
+An illustration of the relevant section of the code is below.


Try to avoid above below left and right. Use follows or as follows. Or previously. This supports different reading modes.

aaronmarkham · 2018-04-06T19:37:01Z

docs/tutorials/python/float16.md

+## Things to keep in mind
+
+### For performance
+1. Nvidia Tensor core essentially perform the computation D = A * B + C, where A and B are half precision matrices, while C and D could be either half precision or full precision. The tensor cores are most efficient when dimensions of these matrices are multiples of 8. This means that Tensor Cores can not be used in all cases for fast float16 computation. When training models like Resnet50 on the Cifar10 dataset, the tensors involved are sometimes smaller, and tensor cores can not always be used. The computation in that case falls back to slower algorithms and using float16 turns out to be slower than float32 on a single GPU. Note that when using multiple GPUs, using float16 can still be faster than float32 because of reduction in communication costs.


cores perform or core performs

Use consistent CapitaliZation throughout.

anirudh2290 · 2018-04-08T18:56:35Z

ping @rahul003

rahul003 · 2018-04-11T00:12:38Z

Thanks guys for your comments. I'll address them soon and update the PR

ThomasDelteil · 2018-04-12T02:07:43Z

docs/tutorials/python/float16.md

+1. [Training with Mixed Precision User Guide](http:https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html)
+2. [Mixed Precision Training at ICLR 2018](https://arxiv.org/pdf/1710.03740.pdf)
+3. [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
+


Can you add this  at the end of your .md file to enable the notebook download of your tutorial? Thanks!

ThomasDelteil · 2018-04-12T02:10:10Z

docs/tutorials/python/float16.md

+url = "https://s3.us-east-2.amazonaws.com/mxnet-public/101_ObjectCategories.tar.gz"
+dataset_name = "101_ObjectCategories"
+data_folder = "data"
+if not os.path.isdir(data_folder):


these lines are unnecessary as the mx.gluon.utils.download will create the directory if it does not exist 😃

ThomasDelteil · 2018-04-12T02:14:44Z

docs/tutorials/python/float16.md

+ pretrained_net = models.get_model(name='resnet50_v2', ctx=ctx, pretrained=True, classes=1000)
+ pretrained_net.hybridize()
+ pretrained_net.cast(dtype)
+


a simpler way of fine-tuning a model from the model-zoo is to use pretrained_net.output = gluon.nn.Dense(101) and then initializing it.

see https://github.com/piiswrong/mxnet/blob/eacac76eff3e5e3d200b771422eadf6f1a2af08e/docs/tutorials/gluon/naming.md#replacing-blocks-from-networks-and-fine-tuning

ThomasDelteil · 2018-04-12T02:19:17Z

docs/tutorials/python/float16.md

+train(net, dtype=DTYPE, num_epochs=25)
+```
+
+We can confirm above that the pretrained model helps achieve much higher accuracy of about 0.97 in the same number of epochs.


Sorry much higher accuracy than?
I think float16 helps you train much faster than float32, but I didn't know it would give you a higher accuracy for a given number of epoch?

rahul003 · 2018-06-08T10:14:12Z

On second thoughts as per feedback above I changed this from a runnable tutorial style to a document focusing on the changes needed to switch to Mixed precision. I updated an example in the source to add the example I had in this tutorial, and put a command to run that in this document.

I've setup two runs with Resnet50 on Imagenet with float32 and float16, whose plots I will add tomorrow.

Also added link to the video tutorial we have from the MXNet Meetup.

I'm hesitant to share raw performance numbers as those would soon become outdated as we improve. I could mention a rough speedup factor instead. What do you guys think?

rahul003 · 2018-06-13T19:44:48Z

@ Reviewers, please check the tutorial now. I think we can merge it and keep updating it if you have suggestions for other things to add. Even as is, it would be very useful.

eric-haibin-lin

Great work! Pls resolve conflicts

…o docs-fp16-new

rahul003 · 2018-06-26T20:33:18Z

can we merge this?

xinyu-intel · 2018-06-28T04:54:37Z

example/image-classification/benchmark_score.py

+ if d == mx.cpu() and dtype == 'float16':
+ #float16 is not supported on CPU
+ continue
+ elif net in ['inception-bn', 'alexnet'] and dt == 'float16':


Benchmark crash here since dt is not defined.

Thanks for letting me know, fixing it here #11533

…e#10391) * dtype for data, working fp16 * test dtype fp16 gluon * add gluon fine tuning code * data iter caltech * caltech iter * working finetuning for fp16, but is it using pretrained params * benchmark fp16 * add wip tutorials * working notebook fp16 * changes to symbolic examples * changes to symbolic examples * add fp16 notebook * remove extra files * remove output of notebook * update md file * remove from faq * dtype for data, working fp16 * test dtype fp16 gluon * add gluon fine tuning code * data iter caltech * caltech iter * working finetuning for fp16, but is it using pretrained params * benchmark fp16 * add wip tutorials * working notebook fp16 * changes to symbolic examples * changes to symbolic examples * add fp16 notebook * remove extra files * remove output of notebook * update md file * remove from faq * WIP address feedback * gluon example * add top5 back * clean up gluon example * address feedback * address comments * move tutorial to faq * Add training curves * formatting * update image * trigger ci

rahul003 added 18 commits March 27, 2018 17:31

dtype for data, working fp16

2599e6e

test dtype fp16 gluon

f874635

add gluon fine tuning code

b5173e2

data iter caltech

a80065e

caltech iter

2c3f344

working finetuning for fp16, but is it using pretrained params

a5025f2

benchmark fp16

e27c8b8

add wip tutorials

d9056f0

Merge branch 'docs-fp16-new' of https://github.com/rahul003/mxnet int…

d1d58bb

…o docs-fp16-new

working notebook fp16

e75cdcf

update master

910fa32

changes to symbolic examples

ae56ace

changes to symbolic examples

fb63684

Merge branch 'docs-fp16-new' of https://github.com/rahul003/mxnet int…

9123231

…o docs-fp16-new

add fp16 notebook

09516aa

remove extra files

d4cab73

remove output of notebook

ae12a4f

update md file

811607b

rahul003 requested a review from szha as a code owner April 4, 2018 06:10

remove from faq

d2b3ff6

rahul003 changed the title ~~[MXNET-139] Tutorial for using float16~~ [MXNET-139] Tutorial for mixed precision training with float16 Apr 4, 2018

eric-haibin-lin reviewed Apr 5, 2018

View reviewed changes

aaronmarkham suggested changes Apr 6, 2018

View reviewed changes

ThomasDelteil reviewed Apr 12, 2018

View reviewed changes

rahul003 and others added 14 commits May 30, 2018 14:40

changes to symbolic examples

dde212b

changes to symbolic examples

6de678e

add fp16 notebook

632862e

remove extra files

a986efd

remove output of notebook

99a5ab1

update md file

60e6fe1

remove from faq

bdaf3c0

WIP address feedback

2cc3579

gluon example

0f854e8

add top5 back

ceae9af

clean up gluon example

1d985b3

address feedback

602e51b

update tutorial

fe7b48a

address comments

653bb2c

move tutorial to faq

ea2c5b7

rahul003 mentioned this pull request Jun 13, 2018

add training curve with fp16/fp32 for Resnet50 v1 Imagenet dmlc/web-data#79

Merged

rahul003 added 2 commits June 13, 2018 12:42

Add training curves

1cca0ce

formatting

6d6c6bd

update image

a7852b4

eric-haibin-lin approved these changes Jun 17, 2018

View reviewed changes

rahul003 added 3 commits June 18, 2018 14:34

Merge branch 'master' into docs-fp16-new

458bccc

trigger ci

eb37906

Merge branch 'docs-fp16-new' of https://github.com/rahul003/mxnet int…

923ed74

…o docs-fp16-new

rahul003 mentioned this pull request Jun 23, 2018

Crash while running gluon image-classification.py example with float16 #11285

Closed

eric-haibin-lin merged commit 99af59b into apache:master Jun 27, 2018

xinyu-intel reviewed Jun 28, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-139] Tutorial for mixed precision training with float16 #10391

[MXNET-139] Tutorial for mixed precision training with float16 #10391

rahul003 commented Apr 4, 2018 •

edited

Loading

eric-haibin-lin Apr 5, 2018

eric-haibin-lin Apr 5, 2018

eric-haibin-lin Apr 5, 2018

eric-haibin-lin Apr 5, 2018

eric-haibin-lin Apr 5, 2018

eric-haibin-lin Apr 5, 2018

eric-haibin-lin Apr 5, 2018

eric-haibin-lin Apr 5, 2018

rahul003 May 22, 2018

mli commented Apr 5, 2018

aaronmarkham left a comment

aaronmarkham Apr 6, 2018

aaronmarkham Apr 6, 2018

aaronmarkham Apr 6, 2018

aaronmarkham Apr 6, 2018

aaronmarkham Apr 6, 2018

aaronmarkham Apr 6, 2018

aaronmarkham Apr 6, 2018

anirudh2290 commented Apr 8, 2018

rahul003 commented Apr 11, 2018

ThomasDelteil Apr 12, 2018

ThomasDelteil Apr 12, 2018 •

edited

Loading

ThomasDelteil Apr 12, 2018 •

edited

Loading

ThomasDelteil Apr 12, 2018

rahul003 commented Jun 8, 2018 •

edited

Loading

rahul003 commented Jun 13, 2018 •

edited

Loading

eric-haibin-lin left a comment

rahul003 commented Jun 26, 2018

xinyu-intel Jun 28, 2018 •

edited

Loading

rahul003 Jul 2, 2018

		Note the accuracy you observe above. You can change DTYPE above to float32 if you want to observe the speedup gained by using float16.


		### Finetuning

		@@ -0,0 +1,280 @@
		# Mixed precision training using float16

		The computational resources required for training deep neural networks has been increasing of late because of complexity of the architectures and size of models. Mixed precision training allows us to reduces the resources required by using lower precision arithmetic. In this approach we train using 16 bit floating points (half precision) while using 32 bit floating points (single precision) for output buffers of float16 computation. This combination of single and half precision gives rise to the name Mixed precision. It allows us to achieve the same accuracy as training with single precision, while decreasing the required memory and training or inference time.


		The float16 data type, is a 16 bit floating point representation according to the IEEE 754 standard. It has a dynamic range where the precision can go from 0.0000000596046 (highest, for values closest to 0) to 32 (lowest, for values in the range 32768-65536). Despite the decreased precision when compared to single precision (float32), float16 computation can be much faster on supported hardware. The motivation for using float16 for deep learning comes from the idea that deep neural network architectures have natural resilience to errors due to backpropagation. Half precision is typically sufficient for training neural networks. This means that on hardware with specialized support for float16 computation we can greatly improve the speed of training and inference. This speedup results from faster matrix multiplication, saving on memory bandwidth and reduced communication costs. It also reduces the size of the model, allowing us to train larger models and use larger batch sizes.

		The Volta range of Graphics Processing Units (GPUs) from Nvidia have Tensor Cores which perform efficient float16 computation. A tensor core allows accumulation of half precision products into single or half precision outputs. For the rest of this tutorial we assume that we are working with Nvidia's Tensor Cores on a Volta GPU.


		The Volta range of Graphics Processing Units (GPUs) from Nvidia have Tensor Cores which perform efficient float16 computation. A tensor core allows accumulation of half precision products into single or half precision outputs. For the rest of this tutorial we assume that we are working with Nvidia's Tensor Cores on a Volta GPU.

		In this tutorial we will walk through how one can train deep learning neural networks with mixed precision on supported hardware. We will first see how to use float16 and then some techniques on achieving good performance and accuracy.


		### Finetuning

		You can also finetune in float16, a model which was originally trained in float32. The section of the code which builds the network would now look as follows. We first fetch the pretrained resnet50_v2 model from model zoo. This was trained using Imagenet data, so we need to pass classes as 1000 for fetching the pretrained model. Then we create our new network for Caltech 101 by passing number of classes as 101. We will then cast it to `float16` so that we cast all parameters to `float16`.


		There are a few examples of building such networks which can handle float16 input in [examples/image-classification/symbols/](https://github.com/apache/incubator-mxnet/tree/master/example/image-classification/symbols). Specifically you could look at the [resnet](https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/symbols/resnet.py) example.

		An illustration of the relevant section of the code is below.

[MXNET-139] Tutorial for mixed precision training with float16 #10391

[MXNET-139] Tutorial for mixed precision training with float16 #10391

Conversation

rahul003 commented Apr 4, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mli commented Apr 5, 2018

aaronmarkham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anirudh2290 commented Apr 8, 2018

rahul003 commented Apr 11, 2018

Choose a reason for hiding this comment

ThomasDelteil Apr 12, 2018 • edited Loading

Choose a reason for hiding this comment

ThomasDelteil Apr 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahul003 commented Jun 8, 2018 • edited Loading

rahul003 commented Jun 13, 2018 • edited Loading

eric-haibin-lin left a comment

Choose a reason for hiding this comment

rahul003 commented Jun 26, 2018

xinyu-intel Jun 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahul003 commented Apr 4, 2018 •

edited

Loading

ThomasDelteil Apr 12, 2018 •

edited

Loading

ThomasDelteil Apr 12, 2018 •

edited

Loading

rahul003 commented Jun 8, 2018 •

edited

Loading

rahul003 commented Jun 13, 2018 •

edited

Loading

xinyu-intel Jun 28, 2018 •

edited

Loading