Stars
Experiment of using Tangent to autodiff triton
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
uwsampl / gpt-fast
Forked from pytorch-labs/gpt-fastSimple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Python package for rematerialization-aware gradient checkpointing
Implemented the max pool filter in CUDA using shared memory
Demo of Rust and axum web framework with Tokio, Tower, Hyper, Serde
Solutions to introductory distributed computing exercises
A low-latency & high-throughput serving engine for LLMs
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
Flash Attention in ~100 lines of CUDA (forward pass only)
Modeling, training, eval, and inference code for OLMo
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Triton-based implementation of Sparse Mixture of Experts.
An attempt at achieving the theoretical best memory bandwidth of my machine.
High performance Transformer implementation in C++.
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Reference implementation of Megalodon 7B model
A throughput-oriented high-performance serving framework for LLMs
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
Bringing stable diffusion models to web browsers. Everything runs inside the browser with no server support.
Helps you write algorithms in PyTorch that adapt to the available (CUDA) memory
SGLang is a fast serving framework for large language models and vision language models.
Transformers with Arbitrarily Large Context
Finetune Llama 3.2, Mistral, Phi, Qwen & Gemma LLMs 2-5x faster with 80% less memory
Edit anything in images powered by segment-anything, ControlNet, StableDiffusion, etc. (ACM MM)
TJ-Solergibert / Megatron-LM
Forked from NVIDIA/Megatron-LMDebugging Megatron. 3D Parallelism, models, training and more!