Block or Report
Block or report nanmi
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseLanguage
Sort by: Recently starred
Starred repositories
TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frame…
Material for cuda-mode lectures
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
collection of benchmarks to measure basic GPU capabilities
llama3 implementation one matrix multiplication at a time
The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"
Standalone Flash Attention v2 kernel without libtorch dependency
[ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"
A Native-PyTorch Library for LLM Fine-tuning
Awesome LLM compression research papers and tools.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.
wyf9661 / typora-free
Forked from zogodo/typora-0.11.18typora-0.11.18 (last free version)
👩🏿💻👨🏾💻👩🏼💻👨🏽💻👩🏻💻中国独立开发者项目列表 -- 分享大家都在做什么
Accessible large language models via k-bit quantization for PyTorch.
Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
A high-throughput and memory-efficient inference and serving engine for LLMs
A CMake toolchain file for iOS/iPadOS, visionOS, macOS, watchOS & tvOS C/C++/Obj-C++ development
The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications.
The official implementation of the EMNLP 2023 paper LLM-FP4
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation: