Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-688] Fix quantization divide by zero errors #11833

Merged

Conversation

OneRaynyDay
Copy link
Contributor

@OneRaynyDay OneRaynyDay commented Jul 20, 2018

Description

The current quantization strategy for calib_mode='entropy' is to calculate the KL divergence for different thresholds and choose the best threshold. This assumes that the random variable is nonzero for all reals and is a continuous random variable. Because we are discretizing the distribution, we smooth the distribution over the range [-threshold, threshold]. What we are not considering is that the entire sampled distribution may be not in the range [-threshold, threshold] and thus we end up with all zeros in the sampled candidate p distribution inside of _get_optimal_threshold.

I have added a check that the distribution(possibly unnormalized) is proper before attempting to smooth or else we'll run into a divide by 0 error.

In most cases, activation functions and layers for classification type problems output numbers symmetric around 0. This is not the case for a regressor's last layer, and there are various other examples where the activation distribution is not around 0, and this was a major blockage for airbnb's adoption into mxnet's quantization capabilities.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Added tests and extra code inside of quantization.py to set all kl divergence to infinity if the probability distribution formed is all zeros. Also there are some typos and one-off-errors in the original implementation that are now fixed.

@OneRaynyDay OneRaynyDay requested a review from szha as a code owner July 20, 2018 00:40
for j in range(num_quantized_bins):
start = j * num_merged_bins
if j == num_quantized_bins - 1:
stop = -1
stop = len(is_nonzeros)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an off-by-1 error that can be caught via the quantization tests that I added. Indexing a numpy array with x[a:-1] excludes the last element.

else:
stop = start + num_merged_bins
norm = is_nonzeros[start:stop].sum()
if norm != 0:
q[start:stop] = float(quantized_bins[j]) / float(norm)
q[sliced_nd_hist == 0] = 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not representative of the quantized distribution, as setting values to 0 artificially will not correctly represent the quantized activation output.

else:
stop = start + num_merged_bins
norm = is_nonzeros[start:stop].sum()
if norm != 0:
q[start:stop] = float(quantized_bins[j]) / float(norm)
q[sliced_nd_hist == 0] = 0
q[start:stop] = float(quantized_bins[j]) / float(num_quantized_bins)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally this was float(norm), and that is not appropriate. Suppose you have the distribution:

[0, 0, ... , 0, 1]

If the num_quantized_bins is 3, then you theoretically should get:

[0, 0, ... , 1/3, 1/3, 1/3]

instead of:

[0, 0, ..., 1, 1, 1]

To make this more clear, suppose your original dist is:

[0, 0, 0, ... , 1/3, 1/3, 1/3]

This should be equivalent after quantization as the first distribution, but it isn't. Under the rules, you would get the same array back, where you have [..., 1/3, 1/3 ,1/3], and the first distribution would give you [..., 1, 1, 1], off by the multiplier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand your change here. The original implementation is following the explanation here (see page 38):
http:https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf

@@ -274,22 +279,21 @@ def _get_optimal_threshold(arr, num_bins=8001, num_quantized_bins=255):
max_val = np.max(arr)
th = max(abs(min_val), abs(max_val))

hist, hist_edeges = np.histogram(arr, bins=num_bins, range=(-th, th))
hist, hist_edges = np.histogram(arr, bins=num_bins, range=(-th, th))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edges is mispelled as edeges throughout the code

# at one edge: [0, 0, ..., 1000]. (histogram)
# We want to make sure that the optimal threshold in this case is the max.
arr = np.array([2]*1000)
res = mx.contrib.quant._get_optimal_threshold(arr, num_quantized_bins=5)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the incorrectly implemented code for _get_optimal_threshold, we would result in a divide by 0 error here.

try:
q = _smooth_distribution(q)
except ValueError:
divergence[i - num_half_quantized_bins] = float("inf")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the distribution is improper, we set the KL divergence to infinity, as it could theoretically model a uniform distribution of parameters [a,b] with either variables unbounded, which means KL divergence is infinity.

@OneRaynyDay
Copy link
Contributor Author

OneRaynyDay commented Jul 20, 2018

For reference:

With new PR, on imagenet, resnet 152, 5 batch, entropy method:

INFO:logger:Finished inference with 16000 images
INFO:logger:Finished with 178.458244 images per second
INFO:logger:('accuracy', 0.75525)
INFO:logger:('top_k_accuracy_5', 0.92275)

Previously:

INFO:logger:Finished inference with 16000 images
INFO:logger:Finished with 165.527678 images per second
INFO:logger:('accuracy', 0.7565)
INFO:logger:('top_k_accuracy_5', 0.9235)

This is a proof that there was no degradation in performance. @reminisce

else:
stop = start + num_merged_bins
norm = is_nonzeros[start:stop].sum()
if norm != 0:
q[start:stop] = float(quantized_bins[j]) / float(norm)
q[sliced_nd_hist == 0] = 0
q[p == 0] = 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the slides on page 38: http:https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf, the zero'd out bins are meant to be w.r.t reference distribution p, rather than the sliced_nd_hist.

@reminisce reminisce merged commit 55fef30 into apache:master Jul 24, 2018
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
* Fix quantization bug

* Added tests and made sure the edge case is now considered correctly without 1 off errors

* Changed back to original truncated distribution but with different kl divergence calc

* Reorder back to original format

* Reorder back to original format (again)

* Change comments

* Clarified comments

* Changed norm division
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants