Stars
[CVPR 2023] Towards Any Structural Pruning; LLMs / SAM / Diffusion / Transformers / YOLOv8 / CNNs
TinyChatEngine: On-Device LLM Inference Library
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
[ICLR 2022] "As-ViT: Auto-scaling Vision Transformers without Training" by Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, Denny Zhou
[ICLR 2022] The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training by Shiwei Liu, Tianlong Chen, Xiaohan Chen, Li Shen, Decebal Constantin Mocanu, Z…
[ICML 2023] UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers.
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
Implementation of ICML 23 Paper: Specializing Smaller Language Models towards Multi-Step Reasoning.
OTOv1-v3, NeurIPS, ICLR, TMLR, DNN Training, Compression, Structured Pruning, Erasing Operators, CNN, Diffusion, LLM
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".
[ICML 2024] CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers.
[EMNLP 2023 Industry Track] A simple prompting approach that enables the LLMs to run inference in batches.
Code for "Lion: Adversarial Distillation of Proprietary Large Language Models (EMNLP 2023)"
Official Pytorch Implementation of "Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity"
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
Accessible large language models via k-bit quantization for PyTorch.
[TMLR 2024] Efficient Large Language Models: A Survey
Awesome LLM compression research papers and tools.