Highlights
- Pro
Stars
A set of hands-on tutorials for CUDA programming
A simple high performance CUDA GEMM implementation.
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
"LightRAG: Simple and Fast Retrieval-Augmented Generation"
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Translate PDF, EPub, webpage, metadata, annotations, notes to the target language. Support 20+ translate services.
VPTQ, A Flexible and Extreme low-bit quantization algorithm
The official Python client for the Huggingface Hub.
Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
🙃 A delightful community-driven (with 2,400+ contributors) framework for managing your zsh configuration. Includes 300+ optional plugins (rails, git, macOS, hub, docker, homebrew, node, php, python…
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
A fast inference library for running LLMs locally on modern consumer-class GPUs
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Sample codes for my CUDA programming book
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
Development repository for the Triton language and compiler
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
Empowering RAG with a memory-based data interface for all-purpose applications!
A modular graph-based Retrieval-Augmented Generation (RAG) system
Official release of InternLM2.5 base and chat models. 1M context support
Arena-Hard-Auto: An automatic LLM benchmark.
For releasing code related to compression methods for transformers, accompanying our publications
A collection of memory efficient attention operators implemented in the Triton language.