Skip to content
View nanmi's full-sized avatar
😉
Atypical AI practitioners
😉
Atypical AI practitioners
Block or Report

Block or report nanmi

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frame…

Python 333 19 Updated Jun 25, 2024

Material for cuda-mode lectures

Jupyter Notebook 1,902 186 Updated Jun 13, 2024

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 457 41 Updated Jul 23, 2024

🏠 将小爱音箱接入 ChatGPT 和豆包,改造成你的专属语音助手。

TypeScript 6,364 580 Updated Jul 24, 2024

collection of benchmarks to measure basic GPU capabilities

Jupyter Notebook 180 29 Updated Jun 21, 2024
C++ 206 24 Updated Jul 15, 2024

llama3 implementation one matrix multiplication at a time

Jupyter Notebook 11,374 863 Updated May 23, 2024

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

Python 241 21 Updated Apr 20, 2024

Standalone Flash Attention v2 kernel without libtorch dependency

C++ 79 12 Updated May 21, 2024

[ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"

Python 300 15 Updated Jul 18, 2024

The official Meta Llama 3 GitHub site

Python 23,585 2,552 Updated Jul 23, 2024

A Native-PyTorch Library for LLM Fine-tuning

Python 3,650 304 Updated Jul 23, 2024

Awesome LLM compression research papers and tools.

900 54 Updated Jul 24, 2024

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

2,046 139 Updated Jul 23, 2024

Grok open release

Python 49,199 8,310 Updated May 29, 2024

llm-export can export llm model to onnx.

Python 170 16 Updated Jul 16, 2024

A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.

Python 126 9 Updated Jul 8, 2024

typora-0.11.18 (last free version)

20 6 Updated Feb 18, 2024

👩🏿‍💻👨🏾‍💻👩🏼‍💻👨🏽‍💻👩🏻‍💻中国独立开发者项目列表 -- 分享大家都在做什么

36,381 3,052 Updated Jul 23, 2024

Accessible large language models via k-bit quantization for PyTorch.

Python 5,796 589 Updated Jul 23, 2024

Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**

Jupyter Notebook 118 8 Updated May 24, 2024

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 6,382 355 Updated Jul 11, 2024

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 23,460 3,347 Updated Jul 24, 2024

Efficient AI Inference & Serving

Python 447 25 Updated Jan 8, 2024

A CMake toolchain file for iOS/iPadOS, visionOS, macOS, watchOS & tvOS C/C++/Obj-C++ development

CMake 1,831 439 Updated Jul 19, 2024
Python 844 83 Updated Jul 23, 2024

Examples in the MLX framework

Python 5,579 801 Updated Jul 23, 2024

The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications.

C 262 45 Updated Jul 23, 2024

The official implementation of the EMNLP 2023 paper LLM-FP4

Python 145 7 Updated Dec 15, 2023

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Python 1,471 168 Updated Jul 23, 2024
Next