Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRAX: Dynamic loading and optimized inference of LoRA adapter models. #505

Open
1 task
irthomasthomas opened this issue Feb 4, 2024 · 0 comments
Open
1 task
Labels
AI-Chatbots Topics related to advanced chatbot platforms integrating multiple AI models Algorithms Sorting, Learning or Classifying. All algorithms go here. finetuning Tools for finetuning of LLMs e.g. SFT or RLHF llm-applications Topics related to practical applications of Large Language Models in various fields llm-inference-engines Software to run inference on large language models llm-serving-optimisations Tips, tricks and tools to speedup inference of large language models MachineLearning ML Models, Training and Inference PEFT Parameter Efficient Fine Tuning of LLMs e.g. LoRA Low Rank Adapter

Comments

@irthomasthomas
Copy link
Owner

LoRAX Docs

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

📖 What is LoRAX?

LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.

🌳 Features

  • 🚅 Dynamic Adapter Loading: include any fine-tuned LoRA adapter in your request, it will be loaded just-in-time without blocking concurrent requests.
  • 🏋️‍♀️ Heterogeneous Continuous Batching: packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
  • 🧁 Adapter Exchange Scheduling: asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
  • 👬 Optimized Inference: high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels (flash-attention, paged attention, SGMV), quantization, token streaming.
  • 🚢 Ready for Production prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation.
  • 🤯 Free for Commercial Use: Apache 2.0 License. Enough said 😎.

URL: https://predibase.github.io/lorax/?h=cpu#features

Suggested labels

{ "label-name": "LoRA Framework", "description": "A powerful framework for serving fine-tuned models on a single GPU efficiently.", "repo": "llm-inference-engines", "confidence": 98.7 }

@irthomasthomas irthomasthomas added AI-Chatbots Topics related to advanced chatbot platforms integrating multiple AI models llm-applications Topics related to practical applications of Large Language Models in various fields llm-inference-engines Software to run inference on large language models New-Label Choose this option if the existing labels are insufficient to describe the content accurately finetuning Tools for finetuning of LLMs e.g. SFT or RLHF Algorithms Sorting, Learning or Classifying. All algorithms go here. MachineLearning ML Models, Training and Inference llm-serving-optimisations Tips, tricks and tools to speedup inference of large language models and removed New-Label Choose this option if the existing labels are insufficient to describe the content accurately labels Feb 4, 2024
@irthomasthomas irthomasthomas changed the title LoRAX Docs LoRAX: Dynamic loading and optimized inference of LoRA adapter models. Feb 13, 2024
@irthomasthomas irthomasthomas added the PEFT Parameter Efficient Fine Tuning of LLMs e.g. LoRA Low Rank Adapter label Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI-Chatbots Topics related to advanced chatbot platforms integrating multiple AI models Algorithms Sorting, Learning or Classifying. All algorithms go here. finetuning Tools for finetuning of LLMs e.g. SFT or RLHF llm-applications Topics related to practical applications of Large Language Models in various fields llm-inference-engines Software to run inference on large language models llm-serving-optimisations Tips, tricks and tools to speedup inference of large language models MachineLearning ML Models, Training and Inference PEFT Parameter Efficient Fine Tuning of LLMs e.g. LoRA Low Rank Adapter
Projects
None yet
Development

No branches or pull requests

1 participant