Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting all sentence-transformer models and frameworks. Infinity is developed under MIT License. Infinity powers inference behind Gradient.ai.
- Deploy any model from MTEB: deploy the model you know from SentenceTransformers
- Fast inference backends: The inference server is built on top of torch, optimum(onnx/tensorrt) and CTranslate2, using FlashAttention to get the most out of your NVIDIA CUDA, AMD ROCM, CPU, AWS INF2 or APPLE MPS accelerator.
- Dynamic batching: New embedding requests are queued while GPU is busy with the previous ones. New requests are squeezed intro your device as soon as ready.
- Correct and tested implementation: Unit and end-to-end tested. Embeddings via infinity are correctly embedded. Lets API users create embeddings till infinity and beyond.
- Easy to use: The API is built on top of FastAPI, Swagger makes it fully documented. API are aligned to OpenAI's Embedding specs. View the docs at https://michaelfeil.eu/infinity on how to get started.
In this demo sentence-transformers/all-MiniLM-L6-v2, deployed at batch-size=2. After initialization, from a second terminal 3 requests (payload 1,1,and 5 sentences) are sent via cURL.
- [2024/03] infinity supports now experimental int8 (cpu/cuda) and fp8 (H100/MI300) support
- [2024/03] Docs are online: https://michaelfeil.eu/infinity/latest/
- [2024/02] Community meetup at the Run:AI Infra Club
- [2024/01] TensorRT / ONNX inference
pip install infinity-emb[all]
After your pip install, with your venv active, you can run the CLI directly.
infinity_emb --model-name-or-path BAAI/bge-small-en-v1.5
Check the --help
command to get a description for all parameters.
infinity_emb --help
Instead of installing the CLI via pip, you may also use docker to run infinity.
Make sure you mount your accelerator, i.e. install nvidia-docker and activate with --gpus all
.
port=7997
docker run -it --gpus all -p $port:$port michaelf34/infinity:latest --model-name-or-path BAAI/bge-small-en-v1.5 --port $port
The download path at runtime can be controlled via the environment variable HF_HOME
.
Instead of the cli & RestAPI you can directly interface with the Python API.
This gives you most flexibility. The Python API builds on asyncio
with its await/async
features, to allow concurrent processing of requests.
import asyncio
from infinity_emb import AsyncEmbeddingEngine, EngineArgs
sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
engine = AsyncEmbeddingEngine.from_args(EngineArgs(model_name_or_path = "BAAI/bge-small-en-v1.5", engine="torch"))
async def main():
async with engine: # engine starts with e