Block or Report
Block or report oujieww
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseStars
Language
Sort by: Recently starred
Implementation of Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
[ACL 2024] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
Awesome LLM compression research papers and tools.
A collection of AWESOME things about mixture-of-experts
PyTorch-UVM on super-large language models.
Library for faster pinned CPU <-> GPU transfer in Pytorch
PyTorch library for cost-effective, fast and easy serving of MoE models.
Official implementations for paper: DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models
Fast Inference of MoE Models with CPU-GPU Orchestration
Run Mixtral-8x7B models in Colab or consumer desktops
A WebUI for Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)
Evaluation Code repository for the paper "ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers". (2023 TMLR Submission)
Openai style api for open large language models, using LLMs just as chatgpt! Support for LLaMA, LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, Xverse, SqlCoder, CodeLLaMA, ChatGLM, ChatGLM2, ChatGLM3 etc.…
Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)
LongQLoRA: Extent Context Length of LLMs Efficiently