Stars
Puzzles for learning Triton, play it with minimal environment configuration!
My learning notes/codes for ML SYS.
A flexible package manager that supports multiple versions, configurations, platforms, and compilers.
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
Generic PyTorch implementation of einsum that supports different semirings
Spring4Shell - Spring Core RCE - CVE-2022-22965
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSy…
Machine-Learning Accelerator System Exploration Tools
Examples of CUDA implementations by Cutlass CuTe
Seamless operability between C++11 and Python
An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & RingAttention)
Dockerized Spring4Shell (CVE-2022-22965) PoC application and exploit
Remote Unauthenticated Code Execution Vulnerability in OpenSSH server (CVE-2024-6387)
Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
Simple and fast low-bit matmul kernels in CUDA / Triton
how to optimize some algorithm in cuda.
collection of benchmarks to measure basic GPU capabilities
BS::thread_pool: a fast, lightweight, and easy-to-use C++17 thread pool library
A Python library transfers PyTorch tensors between CPU and NVMe