Skip to content
View leleucas's full-sized avatar

Block or report leleucas

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Beta Lists are currently in beta. Share feedback and report bugs.
Showing results

extensible collectives library in triton

Python 24 1 Updated Sep 23, 2024

CUDA 编程指南学习

Cuda 27 8 Updated Oct 2, 2018

Flops counter for convolutional networks in pytorch framework

Python 2,787 308 Updated Sep 27, 2024
Python 12 3 Updated Jul 7, 2024

Typora最新的激活破解方案,三步即激活。 😊实时更新中/👩‍🎓学生党必备,有条件支持正版的请不要点开🔞🈲️。Activate Typora

928 120 Updated Jul 15, 2024

Standalone Flash Attention v2 kernel without libtorch dependency

C++ 96 13 Updated Sep 10, 2024

Efficient Triton Kernels for LLM Training

Python 3,114 159 Updated Oct 3, 2024

Graphic notes on Gilbert Strang's "Linear Algebra for Everyone"

PostScript 17,780 2,167 Updated Feb 4, 2024

Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers

Python 188 9 Updated Aug 19, 2024

Ring attention implementation with flash attention

Python 544 42 Updated Sep 20, 2024

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

Python 322 20 Updated Sep 19, 2024

Tile primitives for speedy kernels

Cuda 1,514 58 Updated Oct 3, 2024

Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch

Python 459 27 Updated Aug 15, 2024

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.

C++ 134 9 Updated Oct 2, 2024

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Python 30,238 6,384 Updated Sep 27, 2024

Automatically split your PyTorch models on multiple GPUs for training & inference

Python 619 38 Updated Jan 2, 2024

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Python 19,628 2,505 Updated Sep 30, 2024
Python 15 1 Updated Jun 5, 2024

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

Python 286 47 Updated Sep 29, 2024

Zero Bubble Pipeline Parallelism

Python 263 13 Updated Sep 4, 2024

Yinghan's Code Sample

Cuda 279 53 Updated Jul 25, 2022

🎉 Modern CUDA Learn Notes with PyTorch: fp32, fp16, bf16, fp8/int8, flash_attn, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.

Cuda 1,223 132 Updated Oct 2, 2024

NLTK Source

Python 13,477 2,869 Updated Sep 25, 2024
C++ 471 85 Updated Oct 3, 2024

C++ extensions in PyTorch

Python 995 209 Updated Aug 7, 2024

[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism

Python 42 1 Updated Jul 31, 2024

An experimental CPU backend for Triton

C++ 41 12 Updated Sep 30, 2024

MOSS-RLHF

Python 1,274 98 Updated Mar 3, 2024

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 181 15 Updated May 28, 2024

row-major matmul optimization

C++ 586 78 Updated Sep 9, 2023
Next