Skip to content

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.

License

Notifications You must be signed in to change notification settings

MoaazZaky/infinity

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues MIT License

Infinity ♾️

codecov ci Downloads

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting all sentence-transformer models and frameworks. Infinity is developed under MIT License. Infinity powers inference behind Gradient.ai.

Why Infinity

  • Deploy any model from MTEB: deploy the model you know from SentenceTransformers
  • Fast inference backends: The inference server is built on top of torch, optimum(onnx/tensorrt) and CTranslate2, using FlashAttention to get the most out of your NVIDIA CUDA, AMD ROCM, CPU, AWS INF2 or APPLE MPS accelerator.
  • Dynamic batching: New embedding requests are queued while GPU is busy with the previous ones. New requests are squeezed intro your device as soon as ready.
  • Correct and tested implementation: Unit and end-to-end tested. Embeddings via infinity are correctly embedded. Lets API users create embeddings till infinity and beyond.
  • Easy to use: The API is built on top of FastAPI, Swagger makes it fully documented. API are aligned to OpenAI's Embedding specs. View the docs at https://michaelfeil.eu/infinity on how to get started.

Infinity demo

In this demo sentence-transformers/all-MiniLM-L6-v2, deployed at batch-size=2. After initialization, from a second terminal 3 requests (payload 1,1,and 5 sentences) are sent via cURL.

Latest News 🔥

Getting started

Launch the cli via pip install

pip install infinity-emb[all]

After your pip install, with your venv active, you can run the CLI directly.

infinity_emb --model-name-or-path BAAI/bge-small-en-v1.5

Check the --help command to get a description for all parameters.

infinity_emb --help

Launch the CLI using a pre-built docker container (recommended)

Instead of installing the CLI via pip, you may also use docker to run infinity. Make sure you mount your accelerator, i.e. install nvidia-docker and activate with --gpus all.

port=7997
docker run -it --gpus all -p $port:$port michaelf34/infinity:latest --model-name-or-path BAAI/bge-small-en-v1.5 --port $port

The download path at runtime can be controlled via the environment variable HF_HOME.

Launch it via the Python API

Instead of the cli & RestAPI you can directly interface with the Python API. This gives you most flexibility. The Python API builds on asyncio with its await/async features, to allow concurrent processing of requests.

import asyncio
from infinity_emb import AsyncEmbeddingEngine, EngineArgs

sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
engine = AsyncEmbeddingEngine.from_args(EngineArgs(model_name_or_path = "BAAI/bge-small-en-v1.5", engine="torch"))

async def main(): 
    async with engine: # engine starts with e