LoRAX: Dynamic loading and optimized inference of LoRA adapter models. #505

irthomasthomas · 2024-02-04T09:20:34Z

LoRAX Docs

LoRAX Docs

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

📖 What is LoRAX?

LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.

🌳 Features

🚅 Dynamic Adapter Loading: include any fine-tuned LoRA adapter in your request, it will be loaded just-in-time without blocking concurrent requests.
🏋️‍♀️ Heterogeneous Continuous Batching: packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
🧁 Adapter Exchange Scheduling: asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
👬 Optimized Inference: high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels (flash-attention, paged attention, SGMV), quantization, token streaming.
🚢 Ready for Production prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation.
🤯 Free for Commercial Use: Apache 2.0 License. Enough said 😎.

URL: https://predibase.github.io/lorax/?h=cpu#features

Suggested labels

{ "label-name": "LoRA Framework", "description": "A powerful framework for serving fine-tuned models on a single GPU efficiently.", "repo": "llm-inference-engines", "confidence": 98.7 }

irthomasthomas changed the title ~~LoRAX Docs~~ LoRAX: Dynamic loading and optimized inference of LoRA adapter models. Feb 13, 2024

irthomasthomas added the PEFT Parameter Efficient Fine Tuning of LLMs e.g. LoRA Low Rank Adapter label Feb 13, 2024

This was referenced Feb 27, 2024

S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe #636

Open

LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4 - Predibase - Predibase #645

Open

This was referenced Mar 14, 2024

LoRAX + Outlines: Better JSON Extraction with Structured Generation and LoRA - Predibase - Predibase #709

Open

Comparing LLM Performance: Introducing the Open Source Leaderboard for LLM APIs #766

Open

ShellLM mentioned this issue Jul 25, 2024

Anyscale LLM host performance benchmarks #844

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRAX: Dynamic loading and optimized inference of LoRA adapter models. #505

LoRAX: Dynamic loading and optimized inference of LoRA adapter models. #505

irthomasthomas commented Feb 4, 2024

LoRAX: Dynamic loading and optimized inference of LoRA adapter models. #505

LoRAX: Dynamic loading and optimized inference of LoRA adapter models. #505

Comments

irthomasthomas commented Feb 4, 2024

LoRAX Docs

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Suggested labels

{ "label-name": "LoRA Framework", "description": "A powerful framework for serving fine-tuned models on a single GPU efficiently.", "repo": "llm-inference-engines", "confidence": 98.7 }