Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Python 199 6 Updated Oct 11, 2024

mirage-project / mirage

A multi-level tensor algebra superoptimizer

C++ 538 28 Updated Oct 19, 2024

ohmyzsh / ohmyzsh

🙃 A delightful community-driven (with 2,400+ contributors) framework for managing your zsh configuration. Includes 300+ optional plugins (rails, git, macOS, hub, docker, homebrew, node, php, python…

Shell 173,318 25,880 Updated Oct 18, 2024

microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Python 35,143 4,068 Updated Oct 19, 2024

foundation-model-stack / foundation-model-stack

🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.

Python 156 48 Updated Oct 16, 2024

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 279 65 Updated Sep 8, 2024

NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines

C++ 5,521 941 Updated Oct 18, 2024

turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

Python 3,586 278 Updated Oct 15, 2024

DefTruth / Awesome-LLM-Inference

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

2,675 183 Updated Oct 15, 2024

brucefan1983 / CUDA-Programming

Sample codes for my CUDA programming book

Cuda 1,549 320 Updated Jul 27, 2023

SqueezeBits / QUICK

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

Python 109 6 Updated Mar 6, 2024

Vahe1994 / SpQR

Python 525 42 Updated Jan 16, 2024

triton-lang / triton

Development repository for the Triton language and compiler

C++ 13,098 1,603 Updated Oct 20, 2024

Bruce-Lee-LY / cuda_hgemv

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

Cuda 47 4 Updated Sep 8, 2024

qhjqhj00 / MemoRAG

Empowering RAG with a memory-based data interface for all-purpose applications!

Python 1,092 66 Updated Sep 29, 2024

microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system

Python 18,209 1,765 Updated Oct 20, 2024

InternLM / InternLM

Official release of InternLM2.5 base and chat models. 1M context support

Python 6,322 444 Updated Oct 10, 2024

lmarena / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.

Jupyter Notebook 582 71 Updated Oct 15, 2024

microsoft / TransformerCompression

For releasing code related to compression methods for transformers, accompanying our publications

Python 360 35 Updated Oct 11, 2024

FlagOpen / FlagAttention

A collection of memory efficient attention operators implemented in the Triton language.

Python 211 13 Updated Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wei Huang huangwei021230

Achievements

Achievements

Highlights

Block or report huangwei021230

Stars

puttsk / cuda-tutorial

Cjkkkk / CUDA_gemm

huggingface / transformers

HKUDS / LightRAG

microsoft / BitBLAS

mit-han-lab / duo-attention

Qualcomm-AI-research / gptvq

windingwind / zotero-pdf-translate

microsoft / VPTQ

huggingface / huggingface_hub

thu-ml / SageAttention