weizhenhuan

🎯

Focusing

Frank Wei weizhenhuan

🎯

Focusing

Build High-Performance Deep Learning Inference Framework (Especially for LLM).

9 followers · 73 following

Xi'an Jiaotong University
Xi'an
19:48 (UTC +08:00)
https://t.me/frankwei0109

Achievements

Lists (12)

Sort

Stars

DefTruth / CUDA-Learn-Notes

📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.

Cuda 1,462 160 Updated Nov 18, 2024

TiledTensor / TiledCUDA

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.

C++ 153 10 Updated Nov 18, 2024

linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training

Python 3,434 202 Updated Nov 18, 2024

gpu-mode / triton-index

Cataloging released Triton kernels.

134 7 Updated Aug 26, 2024

karpathy / LLM101n

LLM101n: Let's build a Storyteller

30,177 1,647 Updated Aug 1, 2024

matrix97317 / OneNeuralNetwork

This is a cross-chip platform collection of operators and a unified neural network library.

Python 13 1 Updated Nov 3, 2023

IST-DASLab / gptq

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Python 1,942 154 Updated Mar 27, 2024

isocpp / CppCoreGuidelines

The C++ Core Guidelines are a set of tried-and-true guidelines, rules, and best practices about coding in C++

CSS 42,849 5,439 Updated Oct 24, 2024

Jermmy / pytorch-quantization-demo

A simple network quantization demo using pytorch from scratch.

Python 510 97 Updated Jun 18, 2023

AI-Hypercomputer / maxtext

A simple, performant and scalable Jax LLM!

Python 1,529 294 Updated Nov 18, 2024

NVIDIA / cccl

CUDA Core Compute Libraries

C++ 1,274 163 Updated Nov 18, 2024

xai-org / grok-1

Grok open release

Python 49,575 8,320 Updated Aug 30, 2024

openai / transformer-debugger

Python 4,035 236 Updated Jun 4, 2024

horovod / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Python 14,266 2,240 Updated Aug 31, 2024

ollama / ollama

Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.

Go 98,278 7,823 Updated Nov 18, 2024

KnowingNothing / MatmulTutorial

A Easy-to-understand TensorOp Matmul Tutorial

C++ 290 31 Updated Sep 21, 2024

ELS-RD / kernl

Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.

Jupyter Notebook 1,535 94 Updated Feb 16, 2024

SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

C++ 7,967 413 Updated Sep 6, 2024

haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Python 20,270 2,238 Updated Aug 12, 2024

siliconflow / onediff

OneDiff: An out-of-the-box acceleration library for diffusion models.

Jupyter Notebook 1,699 103 Updated Nov 14, 2024

chadbaldwin / chadbaldwin.github.io

The repo that drives my blog chadbaldwin.net

HTML 33 36 Updated Aug 6, 2024

NUS-HPC-AI-Lab / InfoBatch

Lossless Training Speed Up by Unbiased Dynamic Data Pruning

Python 318 18 Updated Sep 24, 2024

microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.

Python 14,056 1,818 Updated Jul 3, 2024