-
BUAA
Block or Report
Block or report qingchanghan
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseStars
Language
Sort by: Recently starred
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
AniZpZ / smoothquant
Forked from mit-han-lab/smoothquant[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Development repository for the Triton language and compiler
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
Fast inference from large lauguage models via speculative decoding
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
A model compression and acceleration toolbox based on pytorch.
how to optimize some algorithm in cuda.
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
Fast and memory-efficient exact attention
Running large language models on a single GPU for throughput-oriented scenarios.
A high-throughput and memory-efficient inference and serving engine for LLMs
A large-scale 7B pretraining language model developed by BaiChuan-Inc.
NART = NART is not A RunTime, a deep learning inference framework.
Official repo for consistency models.
ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型
GLM-130B: An Open Bilingual Pre-Trained Model (ICLR 2023)
Making large AI models cheaper, faster and more accessible
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
FlagPerf is an open-source software platform for benchmarking AI chips.