LLM quantization is a method of performing efficient computations by representing data in low-density types such as 4-bit, 8-bit, etc., from float32, without loss of accuracy in performance. Typically, it involves scaling inputs with quantization constants to normalize the inputs. After training, weights in float32 result in minimum and maximum weight values, which are mapped to integers, dramatically reducing memory usage.
-
Dynamic Quantization
Quantization happens on-the-fly during inference, causing bottleneck issues during conversion between floating-point and integers.
-
Static Quantization Quantization before inference by analyzing sample data to quantize using a quantization scheme. It has low potential for performance degradation but is influenced by sample data.
(1) Quantization Aware Training (QAT): Quantization is performed before training or additional fine-tuning.
(2) Post-training Quantization (PTQ): Quantizes a pre-trained model using a calibration dataset and some computational resources.
- Static Quantization, PTQ
Reference: [1]https://deeplearning-lab.com/ai/llm-quantization/ [2]https://computing-jhson.tistory.com/65 [3]https://arxiv.org/abs/2310.16836