Block or Report
Block or report ehuaa
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseStars
Language
Sort by: Recently starred
To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
A high-performance inference system for large language models, designed for production environments.
Material for cuda-mode lectures
Development repository for the Triton-Linalg conversion
This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
Sequence Parallel Attention for Long Context LLM Model Training and Inference
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"
FlashInfer: Kernel Library for LLM Serving
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
SGLang is yet another fast serving framework for large language models and vision language models.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Experiments on speculative sampling with Llama models
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Retrieval and Retrieval-augmented LLMs
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
Making large AI models cheaper, faster and more accessible
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.