ehuaa

ehuaa

1 follower · 7 following

Achievements

Block or Report

Block or report ehuaa

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Stars

microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Python 636 22 Updated Aug 13, 2024

vectorch-ai / ScaleLLM

A high-performance inference system for large language models, designed for production environments.

C++ 352 24 Updated Aug 14, 2024

cuda-mode / lectures

Material for cuda-mode lectures

Jupyter Notebook 2,128 218 Updated Aug 11, 2024

Cambricon / triton-linalg

Development repository for the Triton-Linalg conversion

C++ 125 10 Updated Aug 1, 2024

ModelTC / llmc

This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

Python 194 21 Updated Aug 14, 2024

feifeibear / long-context-attention

Sequence Parallel Attention for Long Context LLM Model Training and Inference

Python 260 10 Updated Jun 27, 2024

mit-han-lab / lmquant

Python 91 4 Updated Jun 12, 2024

mit-han-lab / qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Python 371 15 Updated Aug 13, 2024

SqueezeAILab / KVQuant

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Python 254 21 Updated Aug 13, 2024

casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Python 1,551 182 Updated Aug 14, 2024

xai-org / grok-1

Grok open release

Python 49,275 8,311 Updated Aug 7, 2024

thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

Python 255 21 Updated Apr 20, 2024

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda 971 88 Updated Aug 14, 2024

pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Python 5,449 501 Updated Aug 1, 2024

sgl-project / sglang

SGLang is yet another fast serving framework for large language models and vision language models.

Python 4,114 261 Updated Aug 14, 2024

DefTruth / Awesome-LLM-Inference

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

2,230 146 Updated Aug 12, 2024

alipay / PainlessInferenceAcceleration

Python 274 18 Updated Jul 20, 2024

hpcaitech / SwiftInfer

Efficient AI Inference & Serving

Python 450 25 Updated Jan 8, 2024

dust-tt / llama-ssp

Experiments on speculative sampling with Llama models

Python 112 6 Updated Jun 8, 2023

SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

C++ 7,771 399 Updated Jul 15, 2024

chengzeyi / stable-fast

Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.

Python 1,107 66 Updated Jul 16, 2024

bytedance / ByteTransformer

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

C++ 447 34 Updated Mar 15, 2024

lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Python 36,130 4,439 Updated Aug 14, 2024

FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs

Python 6,450 459 Updated Aug 10, 2024

intel / intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

Python 2,089 205 Updated Aug 14, 2024

hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible

Python 38,479 4,320 Updated Aug 14, 2024

microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Python 34,312 4,009 Updated Aug 14, 2024

microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Python 1,816 172 Updated Aug 14, 2024

flexflow / FlexFlow

FlexFlow Serve: Low-Latency, High-Performance LLM Serving

C++ 1,618 221 Updated Aug 14, 2024

ModelTC / lightllm

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Python 2,197 186 Updated Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ehuaa

Achievements

Achievements

Block or report ehuaa

Stars

microsoft / MInference

vectorch-ai / ScaleLLM

cuda-mode / lectures

Cambricon / triton-linalg

ModelTC / llmc

feifeibear / long-context-attention

mit-han-lab / lmquant

mit-han-lab / qserve

SqueezeAILab / KVQuant

casper-hansen / AutoAWQ

xai-org / grok-1

thunlp / InfLLM

flashinfer-ai / flashinfer

pytorch-labs / gpt-fast

sgl-project / sglang

DefTruth / Awesome-LLM-Inference

alipay / PainlessInferenceAcceleration

hpcaitech / SwiftInfer

dust-tt / llama-ssp

SJTU-IPADS / PowerInfer

chengzeyi / stable-fast

bytedance / ByteTransformer

lm-sys / FastChat

FlagOpen / FlagEmbedding

intel / intel-extension-for-transformers

hpcaitech / ColossalAI

microsoft / DeepSpeed

microsoft / DeepSpeed-MII

flexflow / FlexFlow

ModelTC / lightllm