Add bfloat16 floating-point format support based on AMP #17265

rongzha1 · 2020-01-10T07:19:30Z

Description

Bfloat16 is wildly used in Deep Learning especially on training to get a better performance.

This PR is to add bf16 support based on mxnet AMP(Automatic Mixed Precision) module .

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

[done ] Changes are complete (i.e. I finished coding on this PR)
[done ] All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
[ done] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

This PR has passed unitest and preci test in local machine.
Unit tests are added for this PR.

@ZhennanQin @ElaineBao @xinyu-intel @TaoLv @PatricZhao

ptrendx · 2020-01-10T18:22:54Z

.gitmodules

@@ -6,7 +6,7 @@
 url = https://github.com/dmlc/ps-lite
 [submodule "3rdparty/dlpack"]
 path = 3rdparty/dlpack
- url = https://github.com/dmlc/dlpack
+ url = https://github.com/ElaineBao/dlpack.git


Will need to be changed once the changes are merged to upstream dlpack. Leaving this comment as a reminder ;-).

Definitely :) We're working on PR the related code in dlpack dmlc/dlpack#45

ptrendx · 2020-01-10T18:25:16Z

3rdparty/mshadow/mshadow/base.h

+ kInt16 = 8,
+ kUint16 = 9,
+ kUint32 = 10,
+ kUint64 = 11,


Why adding those additional types here? No operator supports them anyway, right?

This is to align the definition with DLPack. Otherwise we have to preserve those numbers. Even through we don't use them currently, it's no harm to add them.

ptrendx · 2020-01-10T18:30:08Z

3rdparty/mshadow/mshadow/bfloat.h

+ MSHADOW_BF16_OPERATOR_TYPE(float, float, OP) \
+ MSHADOW_BF16_OPERATOR_TYPE(double, double, OP) \
+ MSHADOW_BF16_OPERATOR_TYPE(float, int8_t, OP) \
+ MSHADOW_BF16_OPERATOR_TYPE(float, uint8_t, OP) \
+ MSHADOW_BF16_OPERATOR_TYPE(float, int32_t, OP) \
+ MSHADOW_BF16_OPERATOR_TYPE(float, uint32_t, OP) \
+ MSHADOW_BF16_OPERATOR_TYPE(float, int64_t, OP) \
+ MSHADOW_BF16_OPERATOR_TYPE(float, uint64_t, OP)


Returning float or double, while understandable, is a different behavior to the one currently done for half_t type. Could we discuss this and make them consistent?

Sure. Any suggestion here?

ptrendx · 2020-01-10T18:33:37Z

example/quantization/imagenet_inference.py

+ if model_name.find('imagenet1k-resnet-152') != -1:
+ excluded_sym_names += ['conv0']
+ elif model_name.find('imagenet1k-inception-bn') != -1:
+ excluded_sym_names += ['conv_1']
+ elif model_name.find('resnet') != -1 and model_name.find('v1') != -1:
+ excluded_sym_names += ['resnetv10_conv0_fwd']
+ elif model_name.find('resnet') != -1 and model_name.find('v2') != -1:
+ excluded_sym_names += ['resnetv20_conv0_fwd']
+ elif model_name.find('vgg') != -1:
+ excluded_sym_names += ['vgg0_conv0_fwd']
+ elif model_name.find('squeezenet1') != -1:
+ excluded_sym_names += ['squeezenet0_conv0_fwd']
+ elif model_name.find('mobilenet') != -1 and model_name.find('v2') == -1:
+ excluded_sym_names += ['mobilenet0_conv0_fwd']
+ elif model_name.find('mobilenet') != -1 and model_name.find('v2') != -1:
+ excluded_sym_names += ['mobilenetv20_conv0_fwd']
+ elif model_name.find('inceptionv3') != -1:
+ excluded_sym_names += ['inception30_conv0_fwd']


Why? Is there an accuracy issue without those exclusions?

Not for accuracy, but for performance purpose. This could be removed once more bfloat16 hardware added.

Please add a comment for this temp performance solution and will convert all conv layers later.

ptrendx · 2020-01-10T18:45:22Z

python/mxnet/contrib/amp/amp.py

@@ -43,14 +44,17 @@
 from ... import optimizer as opt
 from .loss_scaler import LossScaler

+bfloat16 = np.dtype([('bfloat16', np.uint16)])


Can we have this dtype accessible (as mx.bfloat16 or something similar)?

This is a good topic, and I want to have a discussion for this.
Currently, MXNet doesn't have its own type system. It's simply using Numpy.dtype. Numpy doesn't natively support bfloat16, so we define bfloat16 as a numpy customized type.
Pros: compatible with current design, isinstance(bfloat16, np.dtype) could return True.
cons: bfloat16.name doesn't work, have to use bfloat16.names[0] instead.
Another solution is, creating MXNet's own data type system, just like pytorch and tf. This is a big API change, so we wish this can be done when upgrading to MXNet 2.0.

Currently, we prefer this approach to enable bfloat16 in MXNet 1.x, and refactor it in MXNet 2.0.

ptrendx · 2020-01-10T19:19:54Z

python/mxnet/gluon/parameter.py

+ if data.dtype == np.dtype([('bfloat16', np.uint16)]):
+ assert np.dtype(self.dtype) == data.dtype, \
+ "Failed loading Parameter '%s' from saved params: " \
+ "dtype incompatible expected %s vs saved %s. " \
+ "Set cast_dtype=True to cast the dtype of saved params."%(
+ self.name, str(self.dtype), str(data.dtype))
+ else:
+ assert np.dtype(self.dtype).type == data.dtype, \
+ "Failed loading Parameter '%s' from saved params: " \
+ "dtype incompatible expected %s vs saved %s. " \
+ "Set cast_dtype=True to cast the dtype of saved params."%(
+ self.name, str(self.dtype), str(data.dtype))


Aren't those 2 codepaths the same?

np.dtype(self.dtype) is different from np.dtype(self.dtype).type
https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.type.html
https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html#numpy.dtype

eric-haibin-lin · 2020-01-12T05:47:39Z

include/mxnet/ndarray.h

+ * This creates a new NDArray using f32 with the reordered data.
+ * It doesn't affect the data of the original NDArray.
+ */
+ NDArray Reorder2DefaultFp32() const;


Adding dtype specific interface looks very ad-hoc

OK, will change it to Reorder2DefaultFloatFormat()

eric-haibin-lin · 2020-01-12T05:49:11Z

python/mxnet/ndarray/ndarray.py

@@ -83,6 +84,7 @@
 5: np.int8,
 6: np.int64,
 7: np.bool_,
+ 12: np.dtype([('bfloat16', np.uint16)]),


This is to align the TypeFlag defined in mshadow

src/engine/naive_engine.cc

eric-haibin-lin

For bfloat16 training, which loss scalar is recommended? Do we also need to perform NaN checks?

ElaineBao · 2020-01-14T07:57:34Z

For bfloat16 training, which loss scalar is recommended? Do we also need to perform NaN checks?

Bfloat16 has the same dynamic range as float32, since they have the same exponent bits. So it can represent gradients directly, it doesn't require loss scaling like fp16.

ptrendx · 2020-01-15T15:42:42Z

3rdparty/mshadow/mshadow/base.h

@@ -988,6 +1034,7 @@ struct minimum {
 };
 } // namespace red

+#ifndef __NVCC__


I don't like it - can we make a similar thing as in the case of fp16 for CPUs that did not support F16C instructions (i.e. code that runs but may be slower than the code for hardware natively supporting bfloat16)?

You can implement atomicAdd (which seems to be the problem you are facing)with atomicCAS like this: https://github.com/apache/incubator-mxnet/blob/master/src/common/cuda_utils.h#L702-L721

We don't have enough background / knowledge to enable Bfloat16 on GPU side. So probably we can't make the change you proposed. Alternately, any code refactoring on GPU side is welcome. you may change this as you want in following PR.

eric-haibin-lin · 2020-01-15T23:05:43Z

@ElaineBao thanks for the explanation

xinyu-intel · 2020-01-22T09:34:23Z

.gitmodules

@@ -6,7 +6,7 @@
 url = https://github.com/dmlc/ps-lite
 [submodule "3rdparty/dlpack"]
 path = 3rdparty/dlpack
- url = https://github.com/dmlc/dlpack
+ url = https://github.com/dmlc/dlpack.git


xinyu-intel · 2020-01-22T09:43:26Z

src/operator/nn/fully_connected.cc

@@ -237,7 +238,7 @@ static bool BackwardFCStorageType(const nnvm::NodeAttrs& attrs,
 bool dispatched = false;
 if (!dispatched && common::ContainsOnlyStorage(*in_attrs, mxnet::kDefaultStorage)) {
 dispatched = storage_type_assign(out_attrs, mxnet::kDefaultStorage,
- dispatch_mode, DispatchMode::kFCompute);
+ dispatch_mode, DispatchMode::kFComputeEx);


we may need to enable dnnl fc bwd in another PR since there is an known issue.

Thanks for reminder

pengzhao-intel · 2020-02-08T02:44:30Z

@ptrendx thanks for your review. Feel free to let me know if you have other concerns, we are going to merge PR recently.

TaoLv · 2020-02-14T15:12:43Z

Hi @leezu, @larroy, this PR can pass the builds for ARM but always hit the time out of the test.
http:https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fedge/detail/PR-17265/29/pipeline

We don't have any environment to reproduce. Could you please take a look or have any suggestion for further debugging?

TaoLv · 2020-02-14T15:15:26Z

3rdparty/mshadow/mshadow/bfloat.h

@@ -0,0 +1,167 @@
+/*!


@szha Do we need Apache license header for this new file?

done thanks

zhreshold · 2020-02-15T02:13:45Z

@larroy Do you have idea how to display more logs for the edge tests? It consistently fail at this stage.

…cast_reduce_op_value_part2.cc to pass Win CPU/GPU build (fatal error C1002: compiler is out of heap space in pass 2) 2. rm debug code

2. disable mkldnn fc bwd

… compiler error 'fatal error C1002: compiler is out of heap space in pass 2'

This reverts commit 7360246.

pengzhao-intel

@TaoLv @ZhennanQin @ciyongch @eric-haibin-lin @szha please take a final review for this PR.

If no further comments, I will merge the PR in tomorrow :)

pengzhao-intel · 2020-02-16T03:00:26Z

I am merging now. If there're any other comments, we can resolve by new PR.

* Add Bfloat16 * mshadow support bf16 * rebase bf16 mkldnn1.0 * support bf16 gemm * resolve fp32 ip bwd bug * add other bf16 ops * change func name from fp16 to lp16 (low precision 16), to include bf16 * add amp_cast bf16 support for ndarray * fix executor copy_params * add test case for bf16 * remove numpy dtype hook for bf16 * add bf16 type support * rebase to mxnet master * add single conv test * fix symbolic inference * add dtype check when copy * add single conv and bn test * skip fp16 amp_cast test in cpu * Fix resnet50 first convolution * Skip first convolution for bfloat16 * support bf16 fallback compute * recover origin test * add some bf16 unittests * fix bf16 bn test, enhance assert_almost_equal_with_err * using assert_almost_equal_with_err for fallback bn test * add relu6 bf16 support * fix lint * fix subgraph conv with data=0 * mkldnn doesn't support 0 dim tensor * rm dtype check when copy * using bf16 tvm * rm bf16 mnist demo * use official tvm * change function name; fix lint error * fix clang check error:conditional expression is ambiguous; 'float' can be converted to 'mshadow::bfloat::bf16_t' and vice versa * nvcc compiler build pass * fix gpu amp cast symbol error * fix mnist training error * fix cpp test: Engine.VarVersion error * workaround cpp failed test mkldnn fc bwd * to fix mkldnn test_mkldnn_ndarray_slice error * 1. move some code from to np_broadcast_reduce_op_value.cc to np_broadcast_reduce_op_value_part2.cc to pass Win CPU/GPU build (fatal error C1002: compiler is out of heap space in pass 2) 2. rm debug code * use official dlpack * rename np_broadcast_reduce_op_value_part2.cc and add some description * 1. update dlpack url in .gitmodule 2. disable mkldnn fc bwd * fix remaining NodePtr due to tvm update * mv some code from mxnet_op.h to mxnet_op_kernel_assign.h to avoid WIN compiler error 'fatal error C1002: compiler is out of heap space in pass 2' * fix WIN CPU build fail:compiler is out of heap space in pass 2 * fix WIN build fail * fix lint * add print for test bf16_concat * fix bf16 test fail * disable bf16 concat test * tmp skip to root cause edge test halt * fix bf16_bn test error * enable test_bulk * tmp rm bf16 to locate edge error * Revert "tmp rm bf16 to locate edge error" This reverts commit 7360246. * add Apache license header * trigger CI * add robust for test bf16 bn Co-authored-by: Zhennan Qin <[email protected]> Co-authored-by: YixinBao <[email protected]> Co-authored-by: Xinyu Chen <[email protected]> Co-authored-by: Wuxun Zhang <[email protected]>

rongzha1 requested review from aaronmarkham, anirudh2290, eric-haibin-lin and szha as code owners January 10, 2020 07:19

ptrendx reviewed Jan 10, 2020

View reviewed changes

eric-haibin-lin reviewed Jan 12, 2020

View reviewed changes

pengzhao-intel added the MKLDNN label Jan 14, 2020

pengzhao-intel added this to In progress in CPU Performance and Quantization via automation Jan 14, 2020

pengzhao-intel mentioned this pull request Jan 14, 2020

[RFC] Add bfloat16 data type support dmlc/dlpack#45

Closed

eric-haibin-lin reviewed Jan 14, 2020

View reviewed changes

ptrendx reviewed Jan 15, 2020

View reviewed changes

xinyu-intel suggested changes Jan 22, 2020

View reviewed changes

CPU Performance and Quantization automation moved this from In progress to Review in progress Jan 22, 2020

TaoLv reviewed Feb 14, 2020

View reviewed changes

ZhennanQin and others added 6 commits February 15, 2020 10:57

Add Bfloat16

350da26

mshadow support bf16

b39d3e1

rebase bf16 mkldnn1.0

9c44b66

support bf16 gemm

da7118c

resolve fp32 ip bwd bug

c220bfc

add other bf16 ops

f1055e5

rongzha1 added 23 commits February 15, 2020 10:57

fix cpp test: Engine.VarVersion error

c04fe99

workaround cpp failed test mkldnn fc bwd

b3306c9

to fix mkldnn test_mkldnn_ndarray_slice error

06a32fe

1. move some code from to np_broadcast_reduce_op_value.cc to np_broad…

1b090b5

…cast_reduce_op_value_part2.cc to pass Win CPU/GPU build (fatal error C1002: compiler is out of heap space in pass 2) 2. rm debug code

use official dlpack

44f133b

rename np_broadcast_reduce_op_value_part2.cc and add some description

04a1402

1. update dlpack url in .gitmodule

afe1c09

2. disable mkldnn fc bwd

fix remaining NodePtr due to tvm update

1027985

mv some code from mxnet_op.h to mxnet_op_kernel_assign.h to avoid WIN…

0d83536

… compiler error 'fatal error C1002: compiler is out of heap space in pass 2'

fix WIN CPU build fail:compiler is out of heap space in pass 2

592e67f

fix WIN build fail

e934607

fix lint

ce35638

add print for test bf16_concat

4dfd91b

fix bf16 test fail

405e2aa

disable bf16 concat test

e0246ae

tmp skip to root cause edge test halt

043d315

fix bf16_bn test error

052bf79

enable test_bulk

1880d95

tmp rm bf16 to locate edge error

5edd735

Revert "tmp rm bf16 to locate edge error"

3bad0fe

This reverts commit 7360246.

add Apache license header

4a43b1d

trigger CI

9a37a0a

add robust for test bf16 bn

df090c1

CPU Performance and Quantization automation moved this from Review in progress to Reviewer approved Feb 15, 2020

pengzhao-intel approved these changes Feb 15, 2020

View reviewed changes

pengzhao-intel merged commit 065d48e into apache:master Feb 16, 2020

CPU Performance and Quantization automation moved this from Reviewer approved to Done Feb 16, 2020

Add bfloat16 floating-point format support based on AMP #17265

Add bfloat16 floating-point format support based on AMP #17265

Conversation

rongzha1 commented Jan 10, 2020

Description

Checklist

Essentials

Comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin left a comment

Choose a reason for hiding this comment

ElaineBao commented Jan 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZhennanQin Jan 16, 2020 • edited Loading

Choose a reason for hiding this comment

eric-haibin-lin commented Jan 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pengzhao-intel commented Feb 8, 2020

TaoLv commented Feb 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhreshold commented Feb 15, 2020

pengzhao-intel left a comment

Choose a reason for hiding this comment

pengzhao-intel commented Feb 16, 2020

ZhennanQin Jan 16, 2020 •

edited

Loading