Block or Report
Block or report rainyBJ
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseStars
Language
Sort by: Recently starred
Latency and Memory Analysis of Transformer Models for Training and Inference
RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Course to get into Large Language Models (LLMs) with roadmaps and Colab notebooks.
AIGC-interview/CV-interview/LLMs-interview面试问题与答案集合仓,同时包含工作和科研过程中的新想法、新问题、新资源与新项目
[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation
An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites
This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit"
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
🎉CUDA&C++ 笔记 / 大模型手撕CUDA / 技术博客,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术
A Easy-to-understand TensorOp Matmul Tutorial
Continual Learning of Large Language Models: A Comprehensive Survey
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
Triton Implementation of HyperAttention Algorithm
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching
[ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"
Official code for the paper "Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark"
mi-optimize is a versatile tool designed for the quantization and evaluation of large language models (LLMs). The library's seamless integration of various quantization methods and evaluation techn…
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
List of papers related to Vision Transformers quantization and hardware acceleration in recent AI conferences and journals.