GitHub - opengear-project/GEAR: GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM

Todo List.

simluated code for gsm8k-5shot, bbh-3shot and aqua-8shot with cot prompt on llama models ✔️
Fused quantization supported for GEAR ✔️
More cuda kernel optimization
GEAR supported with lm-harness
Combining with other inference algorithm/system
wrap up a python package

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Official repo for GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference. GEAR is a "plug-and-play" inference only KV compression method. GEAR augments any quantization scheme(e.g. KIVI, KCVT and Flexgen) via an error recovery solution to boost the model accuracy while saving memory.

Here, GEAR is the abbreviation of Generative Inference with LLM via Approximation and Error Recovery.

Overview

GEAR is an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low-rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries.

GEAR does not need to preserve any first or last tokens uncompressed like other low bit compression algorithms to achieve near lossless KV cache compression for LLMs.

How to use GEAR

conda create -n GEAR python==3.10
conda activate GEAR
pip install -r requirements.txt

Reposity architecture

.
├── GenerationBench

cuda_supported_gear GEAR-KIVI implementation with fused kernel supported.

GenerationBench is simluated compression tested on finetuned and un finetuned model with BBH, GSM8K, and Aqua dataset.

Developers

Hao Kang*(Georgia Tech)
Qingru Zhang*(Georgia Tech)
Souvik Kundu(Intel)
Geonhwa Jeong(Georgia Tech)
Zaoxing Liu(University of Maryland)
Tushar Krishna(Georgia Tech)
Tuo Zhao(Georgia Tech)

Citation

Version 2 will be updated soon. Currently it is version 1. link to paper

@misc{kang2024gear,
      title={GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM}, 
      author={Hao Kang and Qingru Zhang and Souvik Kundu and Geonhwa Jeong and Zaoxing Liu and Tushar Krishna and Tuo Zhao},
      year={2024},
      eprint={2403.05527},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Contributing

We are welcoming everyone to contribute to this reposity by rasing PRs. If there is any problem you may also shot email to [email protected].

Disclaimer

This “research quality code” is for Non-Commercial purposes and provided by the contributors “As Is” without any express or implied warranty of any kind. The organizations (Intel or georgia Tech) involved do not own the rights to this data set and do not confer any rights to it. The organizations (Intel or georgia Tech) do not warrant or assume responsibility for the accuracy or completeness of any information, text, graphics, links or other items within the code. A thorough security review has not been performed on this code. Additionally, this repository may contain components that are out of date or contain known security vulnerabilities.

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
Fig		Fig
GenerationBench		GenerationBench
cuda_supported_gear		cuda_supported_gear
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Todo List.

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Overview

How to use GEAR

Reposity architecture

Developers

Citation

Contributing

Disclaimer

About

Releases

Packages

Contributors 4

Languages

License

opengear-project/GEAR

Folders and files

Latest commit

History

Repository files navigation

Todo List.

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Overview

How to use GEAR

Reposity architecture

Developers

Citation

Contributing

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages