Skip to content
View PachecoRJ's full-sized avatar

Block or report PachecoRJ

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

[CVPR 2023] Towards Any Structural Pruning; LLMs / SAM / Diffusion / Transformers / YOLOv8 / CNNs

Python 2,633 329 Updated Sep 30, 2024

TinyChatEngine: On-Device LLM Inference Library

C++ 713 68 Updated Jul 4, 2024

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 2,380 184 Updated Jul 16, 2024

[ICLR 2022] "As-ViT: Auto-scaling Vision Transformers without Training" by Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, Denny Zhou

Python 76 4 Updated Feb 21, 2022

[ICLR 2022] The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training by Shiwei Liu, Tianlong Chen, Xiaohan Chen, Li Shen, Decebal Constantin Mocanu, Z…

Python 71 10 Updated Jan 9, 2023

[ICML 2023] UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers.

Python 95 5 Updated Nov 4, 2023

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Cuda 171 14 Updated Sep 24, 2023

Implementation of ICML 23 Paper: Specializing Smaller Language Models towards Multi-Step Reasoning.

Jupyter Notebook 121 3 Updated Jun 18, 2023

OTOv1-v3, NeurIPS, ICLR, TMLR, DNN Training, Compression, Structured Pruning, Erasing Operators, CNN, Diffusion, LLM

Python 288 46 Updated Sep 16, 2024

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Jupyter Notebook 2,227 153 Updated Jun 25, 2024
Python 68 9 Updated Dec 1, 2023

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 6,583 363 Updated Jul 11, 2024

PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".

Python 71 15 Updated May 23, 2023

[ICML 2024] CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers.

24 Updated Oct 4, 2023

[EMNLP 2023 Industry Track] A simple prompting approach that enables the LLMs to run inference in batches.

Python 65 5 Updated Mar 8, 2024

Code for "Lion: Adversarial Distillation of Proprietary Large Language Models (EMNLP 2023)"

Python 198 19 Updated Feb 11, 2024

Official Pytorch Implementation of "Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity"

Python 49 7 Updated Jun 26, 2024

A simple and effective LLM pruning approach.

Python 623 81 Updated Aug 9, 2024

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

Python 2,121 209 Updated Sep 26, 2024

[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization

Python 632 42 Updated Aug 13, 2024

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Python 1,889 151 Updated Mar 27, 2024

Accessible large language models via k-bit quantization for PyTorch.

Python 6,116 611 Updated Oct 1, 2024

[TMLR 2024] Efficient Large Language Models: A Survey

970 82 Updated Sep 28, 2024

Awesome LLM compression research papers and tools.

1,088 66 Updated Sep 30, 2024