Skip to content
View UranusSeven's full-sized avatar
🎯
Focusing
🎯
Focusing
Block or Report

Block or report UranusSeven

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Beta Lists are currently in beta. Share feedback and report bugs.

Starred repositories

Showing results

Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Python 171 7 Updated Jul 8, 2024

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Python 224 25 Updated Jun 23, 2024

Material for cuda-mode lectures

Jupyter Notebook 1,722 165 Updated Jun 13, 2024

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Python 347 11 Updated Jul 7, 2024

Transformers with Arbitrarily Large Context

Python 571 43 Updated Jul 8, 2024

A low-latency & high-throughput serving engine for LLMs

Python 74 9 Updated Jun 30, 2024

High performance Transformer implementation in C++.

C++ 43 2 Updated Apr 22, 2024

A Survey of AI startups

390 31 Updated Aug 27, 2023

OneDiff: An out-of-the-box acceleration library for diffusion models.

Python 1,440 85 Updated Jul 10, 2024

🐚 OpenDevin: Code Less, Make More

Python 28,565 3,276 Updated Jul 10, 2024

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

833 15 Updated Jul 10, 2024

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 183 27 Updated Jul 9, 2024

Odysseus: Playground of LLM Sequence Parallelism

Python 39 Updated Jun 17, 2024

Standalone Flash Attention v2 kernel without libtorch dependency

C++ 79 12 Updated May 21, 2024

A fast communication-overlapping library for tensor parallelism on GPUs.

C++ 79 7 Updated Jul 9, 2024

πŸ“–A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

1,926 134 Updated Jul 8, 2024

A Easy-to-understand TensorOp Matmul Tutorial

C++ 221 22 Updated Jun 15, 2024

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Python 167 15 Updated Jun 16, 2024

The official Meta Llama 3 GitHub site

Python 23,043 2,440 Updated Jul 3, 2024

Implementation of πŸ’ Ring Attention, from Liu et al. at Berkeley AI, in Pytorch

Python 404 24 Updated Jul 8, 2024

Microsoft Collective Communication Library

C++ 271 26 Updated Sep 20, 2023
Python 4,509 763 Updated Jul 9, 2024
Python 8,814 1,148 Updated Jul 9, 2024

Structured Text Generation

Python 7,083 365 Updated Jul 9, 2024

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Java 9,888 2,860 Updated Jul 10, 2024

The official home of the Presto distributed SQL query engine for big data

Java 15,760 5,287 Updated Jul 10, 2024

Sequence Parallel Attention for Long Context LLM Model Training and Inference

Python 212 7 Updated Jun 27, 2024
Next