Skip to content
View qingchanghan's full-sized avatar
Block or Report

Block or report qingchanghan

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Python 1,126 130 Updated Jul 12, 2024

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Python 8 1 Updated Dec 13, 2023

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 468 41 Updated Jul 27, 2024

Development repository for the Triton language and compiler

C++ 12,127 1,450 Updated Jul 31, 2024

Tensor library for machine learning

C++ 10,470 973 Updated Jul 31, 2024

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

C++ 7,711 395 Updated Jul 15, 2024

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 7,736 842 Updated Jul 30, 2024

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 773 121 Updated Jul 29, 2023

Fast inference from large lauguage models via speculative decoding

Python 441 47 Updated Jul 25, 2024

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

2,102 143 Updated Jul 31, 2024

A model compression and acceleration toolbox based on pytorch.

Python 324 40 Updated Jan 12, 2024

how to optimize some algorithm in cuda.

Cuda 1,284 106 Updated Jul 29, 2024
211 3 Updated Aug 19, 2023

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Python 2,137 183 Updated Jul 31, 2024

Fast and memory-efficient exact attention

Python 12,674 1,132 Updated Jul 30, 2024

Running large language models on a single GPU for throughput-oriented scenarios.

Python 9,096 531 Updated Jul 24, 2024

LLM inference in C/C++

C++ 62,791 9,010 Updated Jul 31, 2024

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 23,938 3,442 Updated Jul 31, 2024

A large-scale 7B pretraining language model developed by BaiChuan-Inc.

Python 5,664 504 Updated Jul 18, 2024

NART = NART is not A RunTime, a deep learning inference framework.

Python 37 14 Updated Mar 2, 2023

Official repo for consistency models.

Python 6,041 410 Updated Mar 22, 2024

ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型

Python 40,151 5,164 Updated Jun 27, 2024

GLM (General Language Model)

Python 3,136 319 Updated Nov 3, 2023

GLM-130B: An Open Bilingual Pre-Trained Model (ICLR 2023)

Python 7,651 608 Updated Jul 25, 2023

Making large AI models cheaper, faster and more accessible

Python 38,421 4,314 Updated Jul 31, 2024

Simplify your onnx model

C++ 3,708 378 Updated Jul 8, 2024

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

C++ 443 33 Updated Mar 15, 2024

FlagPerf is an open-source software platform for benchmarking AI chips.

Python 282 95 Updated Jul 31, 2024

TensorRT Plugin Autogen Tool

Python 365 42 Updated Apr 7, 2023
Next