Block or Report
Block or report UranusSeven
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseHPC💻
Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
FlashInfer: Kernel Library for LLM Serving
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
Ongoing research training transformer models at scale
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and…
Ring attention implementation with flash attention
A Easy-to-understand TensorOp Matmul Tutorial
A fast communication-overlapping library for tensor parallelism on GPUs.
Fast and memory-efficient exact attention
Standalone Flash Attention v2 kernel without libtorch dependency
MSCCL++: A GPU-driven communication stack for scalable AI applications
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
High performance Transformer implementation in C++.
A low-latency & high-throughput serving engine for LLMs
Transformers with Arbitrarily Large Context
To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
Material for cuda-mode lectures
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.