Skip to content
View UranusSeven's full-sized avatar
🎯
Focusing
🎯
Focusing
Block or Report

Block or report UranusSeven

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Beta Lists are currently in beta. Share feedback and report bugs.

Starred repositories

Showing results

The memory layer for Personalized AI

Python 18,965 1,785 Updated Aug 4, 2024

Dynamic Memory Management for Serving LLMs without PagedAttention

C 151 10 Updated Aug 3, 2024

Tile primitives for speedy kernels

Cuda 1,427 53 Updated Aug 4, 2024

A safetensors extension to efficiently store sparse quantized tensors on disk

Python 19 Updated Aug 2, 2024

Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Python 315 22 Updated Jul 25, 2024

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Python 242 27 Updated Jul 30, 2024

Material for cuda-mode lectures

Jupyter Notebook 2,036 203 Updated Jun 13, 2024

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Python 609 21 Updated Aug 1, 2024

Transformers with Arbitrarily Large Context

Python 588 43 Updated Jul 13, 2024

A low-latency & high-throughput serving engine for LLMs

Python 127 17 Updated Jul 31, 2024

High performance Transformer implementation in C++.

C++ 53 4 Updated Apr 22, 2024

A Survey of AI startups

391 31 Updated Aug 27, 2023

OneDiff: An out-of-the-box acceleration library for diffusion models.

Python 1,501 91 Updated Aug 4, 2024

🐚 OpenDevin: Code Less, Make More

Python 29,424 3,400 Updated Aug 4, 2024

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

948 20 Updated Jul 31, 2024

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 196 30 Updated Jul 26, 2024

Odysseus: Playground of LLM Sequence Parallelism

Python 47 1 Updated Jun 17, 2024

Standalone Flash Attention v2 kernel without libtorch dependency

C++ 79 12 Updated May 21, 2024

A fast communication-overlapping library for tensor parallelism on GPUs.

C++ 121 9 Updated Jul 25, 2024

πŸ“–A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

2,128 144 Updated Aug 4, 2024

A Easy-to-understand TensorOp Matmul Tutorial

C++ 234 26 Updated Jun 15, 2024

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Python 182 15 Updated Jul 27, 2024

The official Meta Llama 3 GitHub site

Python 25,188 2,779 Updated Jul 31, 2024

Implementation of πŸ’ Ring Attention, from Liu et al. at Berkeley AI, in Pytorch

Python 423 24 Updated Jul 12, 2024

Microsoft Collective Communication Library

C++ 286 29 Updated Sep 20, 2023
Python 4,611 784 Updated Aug 4, 2024
Python 8,987 1,169 Updated Aug 2, 2024

Structured Text Generation

Python 7,471 382 Updated Aug 1, 2024
Next