Skip to content
View UranusSeven's full-sized avatar
🎯
Focusing
🎯
Focusing
Block or Report

Block or report UranusSeven

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

HPC💻

A list for high-performance computing libs.
23 repositories

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)

C 1,073 409 Updated Jul 10, 2024

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 451 34 Updated Jul 10, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 777 68 Updated Jul 10, 2024
Jupyter Notebook 417 22 Updated Jun 25, 2024
Python 1,120 154 Updated May 28, 2024

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

C++ 7,647 407 Updated Jul 1, 2024

Ongoing research training transformer models at scale

Python 9,351 2,108 Updated Jul 10, 2024

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 7,432 802 Updated Jul 10, 2024

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and…

Python 6,279 1,226 Updated Jul 10, 2024

Ring attention implementation with flash attention

Python 433 30 Updated May 20, 2024

Microsoft Collective Communication Library

C++ 271 26 Updated Sep 20, 2023

A Easy-to-understand TensorOp Matmul Tutorial

C++ 221 22 Updated Jun 15, 2024

A fast communication-overlapping library for tensor parallelism on GPUs.

C++ 79 7 Updated Jul 9, 2024

Fast and memory-efficient exact attention

Python 11,956 1,062 Updated Jul 10, 2024

Standalone Flash Attention v2 kernel without libtorch dependency

C++ 79 12 Updated May 21, 2024

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 183 27 Updated Jul 9, 2024

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

833 15 Updated Jul 10, 2024

High performance Transformer implementation in C++.

C++ 43 2 Updated Apr 22, 2024

A low-latency & high-throughput serving engine for LLMs

Python 74 9 Updated Jun 30, 2024

Transformers with Arbitrarily Large Context

Python 571 43 Updated Jul 8, 2024

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Python 347 11 Updated Jul 7, 2024

Material for cuda-mode lectures

Jupyter Notebook 1,722 165 Updated Jun 13, 2024

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Python 224 25 Updated Jun 23, 2024