leleucas

leleucas

6 followers · 46 following

Lists (1)

Sort

🚀 My stack

6 repositories

Beta Lists are currently in beta. Share feedback and report bugs.

Stars

cchan / tccl

extensible collectives library in triton

Python 24 1 Updated Sep 23, 2024

HolyChen / cuda-tutorial

Forked from YunYang1994/face_recognition

CUDA 编程指南学习

Cuda 27 8 Updated Oct 2, 2018

sovrasov / flops-counter.pytorch

Flops counter for convolutional networks in pytorch framework

Python 2,787 308 Updated Sep 27, 2024

JF-D / Proteus

Python 12 3 Updated Jul 7, 2024

markyin0707 / typora-activation

Typora最新的激活破解方案，三步即激活。 😊实时更新中/👩‍🎓学生党必备，有条件支持正版的请不要点开🔞🈲️。Activate Typora

928 120 Updated Jul 15, 2024

tlc-pack / libflash_attn

Standalone Flash Attention v2 kernel without libtorch dependency

C++ 96 13 Updated Sep 10, 2024

linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training

Python 3,114 159 Updated Oct 3, 2024

kenjihiranabe / The-Art-of-Linear-Algebra

Graphic notes on Gilbert Strang's "Linear Algebra for Everyone"

PostScript 17,780 2,167 Updated Feb 4, 2024

RulinShao / LightSeq

Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers

Python 188 9 Updated Aug 19, 2024

zhuzilin / ring-flash-attention

Ring attention implementation with flash attention

Python 544 42 Updated Sep 20, 2024

feifeibear / long-context-attention

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

Python 322 20 Updated Sep 19, 2024

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 1,514 58 Updated Oct 3, 2024

lucidrains / ring-attention-pytorch

Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch

Python 459 27 Updated Aug 15, 2024

TiledTensor / TiledCUDA

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.

C++ 134 9 Updated Oct 2, 2024

facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Python 30,238 6,384 Updated Sep 27, 2024

BlackSamorez / tensor_parallel

Automatically split your PyTorch models on multiple GPUs for training & inference

Python 619 38 Updated Jan 2, 2024

microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Python 19,628 2,505 Updated Sep 30, 2024

heheda12345 / MagPy

Python 15 1 Updated Jun 5, 2024

InternLM / InternEvo

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

Python 286 47 Updated Sep 29, 2024

sail-sg / zero-bubble-pipeline-parallelism

Forked from NVIDIA/Megatron-LM

Zero Bubble Pipeline Parallelism

Python 263 13 Updated Sep 4, 2024

Python 42 1 Updated Jul 31, 2024

triton-lang / triton-cpu

Forked from triton-lang/triton

An experimental CPU backend for Triton

C++ 41 12 Updated Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly