Skip to content
View Tracin's full-sized avatar

Block or report Tracin

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Code for QuaRot, an end-to-end 4-bit inference of large language models.

Python 257 20 Updated Jul 22, 2024

Fast Hadamard transform in CUDA, with a PyTorch interface

C 94 14 Updated May 24, 2024

This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

Python 234 27 Updated Sep 27, 2024

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

2,563 174 Updated Sep 27, 2024

Building a quick conversation-based search demo with Lepton AI.

TypeScript 7,742 983 Updated Sep 18, 2024

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 573 45 Updated Sep 4, 2024

Official implementation of Half-Quadratic Quantization (HQQ)

Python 670 65 Updated Sep 25, 2024

Built upon Megatron-Deepspeed and HuggingFace Trainer, EasyLLM has reorganized the code logic with a focus on usability. While enhancing usability, it also ensures training efficiency.

Python 38 7 Updated Sep 18, 2024

Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"

Python 341 31 Updated Feb 24, 2024

Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)

Python 2,605 268 Updated Aug 14, 2024

The Triton TensorRT-LLM Backend

Python 664 96 Updated Sep 24, 2024

Awesome LLM compression research papers and tools.

1,085 64 Updated Sep 28, 2024

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 8,299 927 Updated Sep 27, 2024

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

Python 2,742 215 Updated Sep 30, 2023

Offline Quantization Tools for Deploy.

Python 110 16 Updated Dec 28, 2023

A parser, editor and profiler tool for ONNX models.

Python 383 51 Updated Sep 26, 2024

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 2,375 184 Updated Jul 16, 2024

Let us control diffusion models!

Python 29,909 2,700 Updated Feb 25, 2024

A fully compliant RISC-V computer made inside the game Terraria

Rust 3,379 45 Updated Jul 31, 2024

Code for the NeurIPS 2022 paper "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning".

Python 96 14 Updated Jul 11, 2023

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 27,590 4,062 Updated Sep 29, 2024

An attempt to answer the age old interview question "What happens when you type google.com into your browser and press enter?"

39,895 5,545 Updated Aug 19, 2024

Universal LLM Deployment Engine with ML Compilation

Python 18,738 1,528 Updated Sep 28, 2024

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Python 4,364 466 Updated Sep 28, 2024

A framework for few-shot evaluation of language models.

Python 6,556 1,738 Updated Sep 28, 2024

Development repository for the Triton language and compiler

C++ 12,886 1,559 Updated Sep 29, 2024

Accessible large language models via k-bit quantization for PyTorch.

Python 6,108 610 Updated Sep 29, 2024

QLoRA: Efficient Finetuning of Quantized LLMs

Jupyter Notebook 9,949 817 Updated Jun 10, 2024

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Python 36,553 4,509 Updated Sep 25, 2024

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Python 34,916 4,055 Updated Sep 29, 2024
Next