![tensorflow logo](https://raw.githubusercontent.com/github/explore/80688e429a7d4ef2fca1e82350fe8e3517d3494d/topics/tensorflow/tensorflow.png)
Block or Report
Block or report Weili17
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseLanguage
Sort by: Recently starred
Starred repositories
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
SGLang is yet another fast serving framework for large language models and vision language models.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilizatio…
Get up and running with Llama 3.1, Mistral, Gemma 2, and other large language models.
MindSpore online courses: Step into LLM
Sample codes for my CUDA programming book
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
LightSeq: A High Performance Library for Sequence Processing and Generation
Learn CUDA Programming, published by Packt
Material for cuda-mode lectures
使用单个24G显卡,从0开始训练LLM
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
[EMNLP 2023 Industry Track] A simple prompting approach that enables the LLMs to run inference in batches.
📰 Must-read papers and blogs on Speculative Decoding ⚡️
Sequence Parallel Attention for Long Context LLM Model Training and Inference
SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks
a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
《明日方舟》小助手,全日常一键长草!| A one-click tool for the daily tasks of Arknights, supporting all clients.
Disaggregated serving system for Large Language Models (LLMs).
Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.