Skip to content
View Weili17's full-sized avatar
Block or Report

Block or report Weili17

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Python 1,510 171 Updated Jul 31, 2024

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Python 4,177 435 Updated Jul 26, 2024

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Python 1,126 130 Updated Jul 12, 2024

SGLang is yet another fast serving framework for large language models and vision language models.

Python 3,578 220 Updated Jul 31, 2024

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilizatio…

Python 1,690 273 Updated Jul 31, 2024

Python bindings for llama.cpp

Python 7,332 879 Updated Jul 31, 2024

llama 2 Inference

C 32 5 Updated Nov 4, 2023

Get up and running with Llama 3.1, Mistral, Gemma 2, and other large language models.

Go 82,405 6,294 Updated Jul 31, 2024

MindSpore online courses: Step into LLM

Jupyter Notebook 388 82 Updated Jun 14, 2024

《CUDA编程基础与实践》一书的代码

Cuda 72 18 Updated Apr 28, 2022

Sample codes for my CUDA programming book

Cuda 1,476 311 Updated Jul 27, 2023

动手实现大模型推理框架

C++ 96 16 Updated Jul 30, 2024

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

C 5,863 1,717 Updated Jul 26, 2024

LightSeq: A High Performance Library for Sequence Processing and Generation

C++ 3,142 324 Updated May 16, 2023

Learn CUDA Programming, published by Packt

Cuda 959 224 Updated Dec 30, 2023

Material for cuda-mode lectures

Jupyter Notebook 2,001 196 Updated Jun 13, 2024

使用单个24G显卡,从0开始训练LLM

Python 43 6 Updated Jun 27, 2024

LLM inference in C/C++

C++ 62,778 9,008 Updated Jul 31, 2024

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Python 1,792 168 Updated Jul 30, 2024

[EMNLP 2023 Industry Track] A simple prompting approach that enables the LLMs to run inference in batches.

Python 63 5 Updated Mar 8, 2024

📰 Must-read papers and blogs on Speculative Decoding ⚡️

285 12 Updated Jul 30, 2024

Sequence Parallel Attention for Long Context LLM Model Training and Inference

Python 241 9 Updated Jun 27, 2024

SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks

Python 231 21 Updated Aug 7, 2023

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

C++ 1,458 194 Updated Jun 12, 2023

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters

Python 135 14 Updated Jul 30, 2024

《明日方舟》小助手,全日常一键长草!| A one-click tool for the daily tasks of Arknights, supporting all clients.

C++ 12,887 1,711 Updated Jul 31, 2024

Disaggregated serving system for Large Language Models (LLMs).

Jupyter Notebook 211 17 Updated Jun 14, 2024

Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.

Python 9,466 605 Updated Jul 30, 2024

LLM training in simple, raw C/CUDA

Cuda 22,392 2,484 Updated Jul 30, 2024
Next