Lists (1)
Sort Name ascending (A-Z)
Stars
HolyChen / cuda-tutorial
Forked from YunYang1994/face_recognitionCUDA 编程指南学习
Flops counter for convolutional networks in pytorch framework
Typora最新的激活破解方案,三步即激活。 😊实时更新中/👩🎓学生党必备,有条件支持正版的请不要点开🔞🈲️。Activate Typora
Standalone Flash Attention v2 kernel without libtorch dependency
Efficient Triton Kernels for LLM Training
Graphic notes on Gilbert Strang's "Linear Algebra for Everyone"
Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers
Ring attention implementation with flash attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Automatically split your PyTorch models on multiple GPUs for training & inference
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
Zero Bubble Pipeline Parallelism
🎉 Modern CUDA Learn Notes with PyTorch: fp32, fp16, bf16, fp8/int8, flash_attn, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.
kwai / Megatron-Kwai
Forked from NVIDIA/Megatron-LM[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism
triton-lang / triton-cpu
Forked from triton-lang/tritonAn experimental CPU backend for Triton
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).