Create Float16 and Mixed Precision Models
Converting a model to use float16 instead of float32 can decrease the model size (up to half) and improve performance on some GPUs. There may be some accuracy loss, but in many models the new accuracy is acceptable. Tuning data is not needed for float16 conversion, which can make it preferable to quantization.
Contents
Float16 Conversion
Convert a model to float16 by following these steps:
-
Install onnx and onnxconverter-common
pip install onnx onnxconverter-common
-
Use the
convert_float_to_float16
function in python.import onnx from onnxconverter_common import float16 model = onnx.load("path/to/model.onnx") model_fp16 = float16.convert_float_to_float16(model) onnx.save(model_fp16, "path/to/model_fp16.onnx")
Float16 Tool Arguments
If the converted model does not work or has poor accuracy, you may need to set additional arguments.
convert_float_to_float16(model, min_positive_val=1e-7, max_finite_val=1e4, keep_io_types=False,
disable_shape_infer=False, op_block_list=None, node_block_list=None)
model
: The ONNX model to convert.min_positive_val
,max_finite_val
: Constant values will be clipped to these bounds.0.0
,nan
,inf
, and-inf
will be unchanged.keep_io_types
: Whether model inputs/outputs should be left as float32.disable_shape_infer
: Skips running onnx shape/type inference. Useful if shape inference is crashing, shapes/types are already present in the model, or types are not needed (types are used to determine where cast ops are needed for unsupported/blocked ops).op_block_list
: List of op types to leave as float32. By default uses the list fromfloat16.DEFAULT_OP_BLOCK_LIST
. This list has ops that are not supported for float16 in ONNX Runtime.node_block_list
: List of node names to leave as float32.
NOTE: Blocked ops will have have casts inserted around them to/from float16/float32. Currently, if two blocked ops are next to each other, the casts will still be inserted, creating a redundant pair. ORT will optimize this pair out at runtime, so the results will remain at full-precision.
Mixed Precision
If float16 conversion is giving poor results, you can convert most of the ops to float16 but leave some in float32. The auto_mixed_precision.auto_convert_mixed_precision
tool finds a minimal set of ops to skip while retaining a certain level of accuracy. You will need to provide a sample input for the model.
Since the CPU version of ONNX Runtime doesn’t support float16 ops and the tool needs to measure the accuracy loss, the mixed precision tool must be run on a device with a GPU.
from onnxconverter_common import auto_mixed_precision
import onnx
model = onnx.load("path/to/model.onnx")
# Assuming x is the input to the model
feed_dict = {'input': x.numpy()}
model_fp16 = auto_convert_mixed_precision(model, feed_dict, rtol=0.01, atol=0.001, keep_io_types=True)
onnx.save(model_fp16, "path/to/model_fp16.onnx")
Mixed Precision Tool Arguments
auto_convert_mixed_precision(model, feed_dict, validate_fn=None, rtol=None, atol=None, keep_io_types=False)
model
: The ONNX model to convert.feed_dict
: Test data used to measure the accuracy of the model during conversion. Format is similar to InferenceSession.run (map of input names to values)validate_fn
: A function accepting two lists of numpy arrays (the outputs of the float32 model and the mixed-precision model, respectively) that returnsTrue
if the results are sufficiently close andFalse
otherwise. Can be used instead of or in addition tortol
andatol
.rtol
,atol
: Absolute and relative tolerances used for validation. See numpy.allclose for more information.keep_io_types
: Whether model inputs/outputs should be left as float32.
The mixed precision tool works by converting clusters of ops to float16. If a cluster fails, it is split in half and both clusters are tried independently. A visualization of the cluster sizes is printed as the tool works.