LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.
- 🚅 Dynamic Adapter Loading: include any fine-tuned LoRA adapter from HuggingFace, Predibase, or any filesystem in your request, it will be loaded just-in-time without blocking concurrent requests. Merge adapters per request to instantly create powerful ensembles.
- 🏋️♀️ Heterogeneous Continuous Batching: packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
- 🧁 Adapter Exchang