Skip to content
View ehuaa's full-sized avatar
Block or Report

Block or report ehuaa

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Python 636 22 Updated Aug 13, 2024

A high-performance inference system for large language models, designed for production environments.

C++ 352 24 Updated Aug 14, 2024

Material for cuda-mode lectures

Jupyter Notebook 2,128 218 Updated Aug 11, 2024

Development repository for the Triton-Linalg conversion

C++ 125 10 Updated Aug 1, 2024

This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

Python 194 21 Updated Aug 14, 2024

Sequence Parallel Attention for Long Context LLM Model Training and Inference

Python 260 10 Updated Jun 27, 2024
Python 91 4 Updated Jun 12, 2024

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Python 371 15 Updated Aug 13, 2024

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Python 254 21 Updated Aug 13, 2024

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Python 1,551 182 Updated Aug 14, 2024

Grok open release

Python 49,275 8,311 Updated Aug 7, 2024

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

Python 255 21 Updated Apr 20, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 971 88 Updated Aug 14, 2024

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Python 5,449 501 Updated Aug 1, 2024

SGLang is yet another fast serving framework for large language models and vision language models.

Python 4,114 261 Updated Aug 14, 2024

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

2,230 146 Updated Aug 12, 2024

Efficient AI Inference & Serving

Python 450 25 Updated Jan 8, 2024

Experiments on speculative sampling with Llama models

Python 112 6 Updated Jun 8, 2023

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

C++ 7,771 399 Updated Jul 15, 2024

Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.

Python 1,107 66 Updated Jul 16, 2024

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

C++ 447 34 Updated Mar 15, 2024

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Python 36,130 4,439 Updated Aug 14, 2024

Retrieval and Retrieval-augmented LLMs

Python 6,450 459 Updated Aug 10, 2024

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

Python 2,089 205 Updated Aug 14, 2024

Making large AI models cheaper, faster and more accessible

Python 38,479 4,320 Updated Aug 14, 2024

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Python 34,312 4,009 Updated Aug 14, 2024

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Python 1,816 172 Updated Aug 14, 2024

FlexFlow Serve: Low-Latency, High-Performance LLM Serving

C++ 1,618 221 Updated Aug 14, 2024

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Python 2,197 186 Updated Aug 12, 2024
Next