Starred repositories
a high-performance, large-capacity, multi-tenant, data-persistent, strong data consistency based on raft, Redis-compatible elastic KV data storage system based on RocksDB
Flash Attention in ~100 lines of CUDA (forward pass only)
使用 cutlass 实现 flash-attention 精简版,具有教学意义
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
a high-performance, large-capacity, multi-tenant, data-persistent, strong data consistency based on raft, Redis-compatible elastic KV data storage system based on RocksDB
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
how to learn PyTorch and OneFlow
程序员在家做饭方法指南。Programmer's guide about how to cook at home (Simplified Chinese only).
Source code for all entries from the 2023 ZPrize competition
Fast and memory-efficient exact attention
A high-throughput and memory-efficient inference and serving engine for LLMs
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
A minimal GPU design in Verilog to learn how GPUs work from the ground up
Pika is a Redis-Compatible database developed by Qihoo's infrastructure team.
A fast, light-weight proxy for memcached and redis
Public repostory for the DAC 2021 paper "Scaling up HBM Efficiency of Top-K SpMV forApproximate Embedding Similarity on FPGAs"
A fast inference library for running LLMs locally on modern consumer-class GPUs
🎉 Modern CUDA Learn Notes with PyTorch: fp32, fp16, bf16, fp8/int8, flash_attn, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.
GNU Libc - Extremely old repo used for research purposes years ago. Please do not rely on this repo.
man-pages-zh / manpages-zh
Forked from lidaobing/manpages-zhChinese Manual Pages
An unofficial cuda assembler, for all generations of SASS, hopefully :)