-
02:20
(UTC +08:00)
Block or Report
Block or report jeejeelee
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseStars
Language
Sort by: Recently starred
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
FlagGems is an operator library for large language models implemented in Triton Language.
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
4 bits quantization of LLaMA using GPTQ
SGLang is yet another fast serving framework for large language models and vision language models.
This project is the official implementation of 'Basic Binary Convolution Unit for Binarized Image Restoration Network', ICLR2023
🎉CUDA/C++ 笔记 / 大模型手撕CUDA / 技术博客,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3
A curated list for Efficient Large Language Models
中文大模型能力评测榜单:目前已囊括106个大模型,覆盖chatgpt、gpt4o、百度文心一言、阿里通义千问、讯飞星火、商汤senseChat、minimax等商用模型, 以及百川、qwen2、glm4、yi、书生internLM2、llama3等开源大模型,多维度能力评测。不仅提供能力评分排行榜,也提供所有模型的原始输出结果!
Large Language Model (LLM) Systems Paper List
Transparent Image Layer Diffusion using Latent Transparency
Awesome LLM compression research papers and tools.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine
FlashInfer: Kernel Library for LLM Serving
InstantID : Zero-shot Identity-Preserving Generation in Seconds 🔥
GLake: optimizing GPU memory management and IO transmission.
Official codes of DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior
collection of benchmarks to measure basic GPU capabilities
Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.