Stars
Code for QuaRot, an end-to-end 4-bit inference of large language models.
Fast Hadamard transform in CUDA, with a PyTorch interface
This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Building a quick conversation-based search demo with Lepton AI.
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Official implementation of Half-Quadratic Quantization (HQQ)
Built upon Megatron-Deepspeed and HuggingFace Trainer, EasyLLM has reorganized the code logic with a focus on usability. While enhancing usability, it also ensures training efficiency.
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)
The Triton TensorRT-LLM Backend
Awesome LLM compression research papers and tools.
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
A parser, editor and profiler tool for ONNX models.
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
A fully compliant RISC-V computer made inside the game Terraria
Code for the NeurIPS 2022 paper "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning".
A high-throughput and memory-efficient inference and serving engine for LLMs
An attempt to answer the age old interview question "What happens when you type google.com into your browser and press enter?"
Universal LLM Deployment Engine with ML Compilation
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
A framework for few-shot evaluation of language models.
Development repository for the Triton language and compiler
Accessible large language models via k-bit quantization for PyTorch.
QLoRA: Efficient Finetuning of Quantized LLMs
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.