SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
sparsity
pruning
quantization
knowledge-distillation
auto-tuning
int8
low-precision
quantization-aware-training
post-training-quantization
awq
int4
large-language-models
gptq
smoothquant
sparsegpt
fp4
mxformat
-
Updated
Nov 7, 2024 - Python