Skip to content
View zhenxl's full-sized avatar
  • chitu.ai
  • beijing

Block or report zhenxl

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Experiment of using Tangent to autodiff triton

Python 70 1 Updated Jan 22, 2024

A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.

Python 473 21 Updated Oct 25, 2024

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Python 1 Updated Oct 4, 2024

Python package for rematerialization-aware gradient checkpointing

Python 23 3 Updated Oct 31, 2023

Implemented the max pool filter in CUDA using shared memory

Cuda 5 1 Updated Sep 10, 2019
Jupyter Notebook 1 Updated Apr 25, 2024

Demo of Rust and axum web framework with Tokio, Tower, Hyper, Serde

Rust 363 30 Updated Oct 17, 2024

Solutions to introductory distributed computing exercises

Rust 9 1 Updated Apr 9, 2023

A low-latency & high-throughput serving engine for LLMs

Python 230 31 Updated Sep 12, 2024

FlexFlow Serve: Low-Latency, High-Performance LLM Serving

C++ 1,696 226 Updated Nov 7, 2024

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 603 54 Updated Apr 7, 2024

Modeling, training, eval, and inference code for OLMo

Python 4,598 468 Updated Nov 7, 2024

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

Cuda 274 43 Updated Nov 28, 2021

Triton-based implementation of Sparse Mixture of Experts.

Python 185 14 Updated Oct 10, 2024

An attempt at achieving the theoretical best memory bandwidth of my machine.

C 52 19 Updated May 19, 2013

High performance Transformer implementation in C++.

C++ 77 10 Updated Sep 14, 2024

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Python 729 37 Updated Nov 6, 2024

Reference implementation of Megalodon 7B model

Cuda 504 52 Updated Apr 18, 2024
Jupyter Notebook 764 372 Updated Mar 12, 2024

A throughput-oriented high-performance serving framework for LLMs

Cuda 627 24 Updated Sep 21, 2024

A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).

Python 95 7 Updated Jul 5, 2024

Bringing stable diffusion models to web browsers. Everything runs inside the browser with no server support.

Jupyter Notebook 3,589 227 Updated Mar 12, 2024

Helps you write algorithms in PyTorch that adapt to the available (CUDA) memory

Python 424 10 Updated Aug 29, 2024

SGLang is a fast serving framework for large language models and vision language models.

Python 5,911 482 Updated Nov 7, 2024

Transformers with Arbitrarily Large Context

Python 635 52 Updated Aug 12, 2024

A cross-platform browser ML framework.

Rust 616 33 Updated Nov 4, 2024

Finetune Llama 3.2, Mistral, Phi, Qwen & Gemma LLMs 2-5x faster with 80% less memory

Python 17,845 1,237 Updated Nov 7, 2024

Edit anything in images powered by segment-anything, ControlNet, StableDiffusion, etc. (ACM MM)

Python 3,318 189 Updated Feb 29, 2024

Debugging Megatron. 3D Parallelism, models, training and more!

Python 2 Updated Oct 9, 2024
Next