Block or Report
Block or report UranusSeven
Contact GitHub support about this userβs behavior. Learn more about reporting abuse.
Report abuseLists (9)
Sort Oldest
Language
Sort by: Recently starred
Starred repositories
Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.
Material for cuda-mode lectures
To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
Transformers with Arbitrarily Large Context
A low-latency & high-throughput serving engine for LLMs
High performance Transformer implementation in C++.
OneDiff: An out-of-the-box acceleration library for diffusion models.
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
MSCCL++: A GPU-driven communication stack for scalable AI applications
Odysseus: Playground of LLM Sequence Parallelism
Standalone Flash Attention v2 kernel without libtorch dependency
A fast communication-overlapping library for tensor parallelism on GPUs.
πA curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
A Easy-to-understand TensorOp Matmul Tutorial
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
The official home of the Presto distributed SQL query engine for big data
Sequence Parallel Attention for Long Context LLM Model Training and Inference