Skip to content
View huangwei021230's full-sized avatar

Highlights

  • Pro

Block or report huangwei021230

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A set of hands-on tutorials for CUDA programming

Cuda 188 32 Updated Apr 8, 2024

A simple high performance CUDA GEMM implementation.

Cuda 331 36 Updated Jan 4, 2024

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Python 133,640 26,697 Updated Oct 18, 2024

"LightRAG: Simple and Fast Retrieval-Augmented Generation"

Python 4,715 467 Updated Oct 19, 2024

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Python 376 30 Updated Oct 17, 2024

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Python 203 6 Updated Oct 17, 2024
Shell 22 2 Updated Mar 28, 2024

Translate PDF, EPub, webpage, metadata, annotations, notes to the target language. Support 20+ translate services.

TypeScript 7,257 344 Updated Oct 16, 2024

VPTQ, A Flexible and Extreme low-bit quantization algorithm

Python 406 24 Updated Oct 19, 2024

The official Python client for the Huggingface Hub.

Python 2,052 539 Updated Oct 18, 2024

Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Python 199 6 Updated Oct 11, 2024

A multi-level tensor algebra superoptimizer

C++ 538 28 Updated Oct 19, 2024

🙃 A delightful community-driven (with 2,400+ contributors) framework for managing your zsh configuration. Includes 300+ optional plugins (rails, git, macOS, hub, docker, homebrew, node, php, python…

Shell 173,318 25,880 Updated Oct 18, 2024

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Python 35,143 4,068 Updated Oct 19, 2024

🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.

Python 156 48 Updated Oct 16, 2024

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 279 65 Updated Sep 8, 2024

CUDA Templates for Linear Algebra Subroutines

C++ 5,521 941 Updated Oct 18, 2024

A fast inference library for running LLMs locally on modern consumer-class GPUs

Python 3,586 278 Updated Oct 15, 2024

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

2,675 183 Updated Oct 15, 2024

Sample codes for my CUDA programming book

Cuda 1,549 320 Updated Jul 27, 2023

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

Python 109 6 Updated Mar 6, 2024
Python 525 42 Updated Jan 16, 2024

Development repository for the Triton language and compiler

C++ 13,098 1,603 Updated Oct 20, 2024

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

Cuda 47 4 Updated Sep 8, 2024

Empowering RAG with a memory-based data interface for all-purpose applications!

Python 1,092 66 Updated Sep 29, 2024

A modular graph-based Retrieval-Augmented Generation (RAG) system

Python 18,209 1,765 Updated Oct 20, 2024

Official release of InternLM2.5 base and chat models. 1M context support

Python 6,322 444 Updated Oct 10, 2024

Arena-Hard-Auto: An automatic LLM benchmark.

Jupyter Notebook 582 71 Updated Oct 15, 2024

For releasing code related to compression methods for transformers, accompanying our publications

Python 360 35 Updated Oct 11, 2024

A collection of memory efficient attention operators implemented in the Triton language.

Python 211 13 Updated Jun 5, 2024
Next