Skip to content

Skkuhodomo/mlx_llama

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LLM Quantization Using MLX

What is LLM Quantization?

LLM quantization is a method of performing efficient computations by representing data in low-density types such as 4-bit, 8-bit, etc., from float32, without loss of accuracy in performance. Typically, it involves scaling inputs with quantization constants to normalize the inputs. After training, weights in float32 result in minimum and maximum weight values, which are mapped to integers, dramatically reducing memory usage.

Quantization Methods

  • Dynamic Quantization

    Quantization happens on-the-fly during inference, causing bottleneck issues during conversion between floating-point and integers.

  • Static Quantization Quantization before inference by analyzing sample data to quantize using a quantization scheme. It has low potential for performance degradation but is influenced by sample data.

    (1) Quantization Aware Training (QAT): Quantization is performed before training or additional fine-tuning.

    (2) Post-training Quantization (PTQ): Quantizes a pre-trained model using a calibration dataset and some computational resources.

On my code covert.py

  • Static Quantization, PTQ

Principles of PTQ, 4-bit Quantization

Reference: [1]https://deeplearning-lab.com/ai/llm-quantization/ [2]https://computing-jhson.tistory.com/65 [3]https://arxiv.org/abs/2310.16836

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages