A collection of all available inference solutions for the LLMs
Name | Org | Description |
---|---|---|
vllm | UC Berkeley | A high-throughput and memory-efficient inference and serving engine for LLMs |
Text-Generation-Inference | Hugginface🤗 | Large Language Model Text Generation Inference |
llm-engine | ScaleAI | Scale LLM Engine public repository |
DeepSpeed | Microsoft | DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective |
OpenLLM | BentoML | Operating LLMs in production |
LLMDeploy | InternLM Team | LMDeploy is a toolkit for compressing, deploying, and serving LLM |
FlexFlow | CMU,Stanford,UCSD | A distributed deep learning framework. |
CTranslate2 | OpenNMT | Fast inference engine for Transformer models |
Fastchat | lm-sys | An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena. |
Triton-Inference-Server | Nvidia | The Triton Inference Server provides an optimized cloud and edge inferencing solution. |
Lepton.AI | lepton.ai | A Pythonic framework to simplify AI service building |
ScaleLLM | Vectorch | A high-performance inference system for large language models, designed for production environments |
Lorax | Predibase | Serve 100s of Fine-Tuned LLMs in Production for the Cost of 1 |
TensorRT-LLM | Nvidia | TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines |
mistral.rs | mistral.rs | Blazingly fast LLM inference. |
NanoFlow | NanoFlow | A throughput-oriented high-performance serving framework for LLMs |
LMCache | LMCache | Fast and Cost Efficient Inference |
Litserve | Lighting.AI | Lightning-fast serving engine for AI models. Flexible. Easy. Enterprise-scale. |