Tracin

Follow

Tracin

Follow

11 followers · 10 following

Achievements

Achievements

Stars

spcl / QuaRot

Code for QuaRot, an end-to-end 4-bit inference of large language models.

Python 257 20 Updated Jul 22, 2024

Dao-AILab / fast-hadamard-transform

Fast Hadamard transform in CUDA, with a PyTorch interface

C 94 14 Updated May 24, 2024

ModelTC / llmc

This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

Python 234 27 Updated Sep 27, 2024

DefTruth / Awesome-LLM-Inference

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

2,563 174 Updated Sep 27, 2024

leptonai / search_with_lepton

Building a quick conversation-based search demo with Lepton AI.

TypeScript 7,742 983 Updated Sep 18, 2024

IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 573 45 Updated Sep 4, 2024

mobiusml / hqq

Official implementation of Half-Quadratic Quantization (HQQ)

Python 670 65 Updated Sep 25, 2024

ModelTC / EasyLLM

Built upon Megatron-Deepspeed and HuggingFace Trainer, EasyLLM has reorganized the code logic with a focus on usability. While enhancing usability, it also ensures training efficiency.

Python 38 7 Updated Sep 18, 2024

Cornell-RelaxML / QuIP

Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"

Python 341 31 Updated Feb 24, 2024

dvlab-research / LongLoRA

Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)

Python 2,605 268 Updated Aug 14, 2024

triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend

Python 664 96 Updated Sep 24, 2024

HuangOwen / Awesome-LLM-Compression

Awesome LLM compression research papers and tools.

1,085 64 Updated Sep 28, 2024

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 8,299 927 Updated Sep 27, 2024

turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

Python 2,742 215 Updated Sep 30, 2023

ModelTC / Dipoorlet

Offline Quantization Tools for Deploy.

Python 110 16 Updated Dec 28, 2023

ThanatosShinji / onnx-tool

A parser, editor and profiler tool for ONNX models.

Python 383 51 Updated Sep 26, 2024

mit-han-lab / llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 2,375 184 Updated Jul 16, 2024

lllyasviel / ControlNet

Let us control diffusion models!

Python 29,909 2,700 Updated Feb 25, 2024

misprit7 / computerraria

A fully compliant RISC-V computer made inside the game Terraria

Rust 3,379 45 Updated Jul 31, 2024

IST-DASLab / OBC

Code for the NeurIPS 2022 paper "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning".

Python 96 14 Updated Jul 11, 2023

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 27,590 4,062 Updated Sep 29, 2024

alex / what-happens-when

An attempt to answer the age old interview question "What happens when you type google.com into your browser and press enter?"

39,895 5,545 Updated Aug 19, 2024

mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation

Python 18,738 1,528 Updated Sep 28, 2024

AutoGPTQ / AutoGPTQ

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Python 4,364 466 Updated Sep 28, 2024

EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

Python 6,556 1,738 Updated Sep 28, 2024

triton-lang / triton

Development repository for the Triton language and compiler

C++ 12,886 1,559 Updated Sep 29, 2024

bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

Python 6,108 610 Updated Sep 29, 2024

artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs

Jupyter Notebook 9,949 817 Updated Jun 10, 2024

lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Python 36,553 4,509 Updated Sep 25, 2024

microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Python 34,916 4,055 Updated Sep 29, 2024