Infinity ♾️

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting all sentence-transformer models and frameworks. Infinity is developed under MIT License. Infinity powers inference behind Gradient.ai.

Why Infinity

Deploy any model from MTEB: deploy the model you know from SentenceTransformers
Fast inference backends: The inference server is built on top of torch, optimum(onnx/tensorrt) and CTranslate2, using FlashAttention to get the most out of your NVIDIA CUDA, AMD ROCM, CPU, AWS INF2 or APPLE MPS accelerator.
Dynamic batching: New embedding requests are queued while GPU is busy with the previous ones. New requests are squeezed intro your device as soon as ready.
Correct and tested implementation: Unit and end-to-end tested. Embeddings via infinity are correctly embedded. Lets API users create embeddings till infinity and beyond.
Easy to use: The API is built on top of FastAPI, Swagger makes it fully documented. API are aligned to OpenAI's Embedding specs. View the docs at https://michaelfeil.eu/infinity on how to get started.

Infinity demo

In this demo sentence-transformers/all-MiniLM-L6-v2, deployed at batch-size=2. After initialization, from a second terminal 3 requests (payload 1,1,and 5 sentences) are sent via cURL.

Latest News 🔥

[2024/03] infinity supports now experimental int8 (cpu/cuda) and fp8 (H100/MI300) support
[2024/03] Docs are online: https://michaelfeil.eu/infinity/latest/
[2024/02] Community meetup at the Run:AI Infra Club
[2024/01] TensorRT / ONNX inference

Getting started

Launch the cli via pip install

pip install infinity-emb[all]

After your pip install, with your venv active, you can run the CLI directly.

infinity_emb --model-name-or-path BAAI/bge-small-en-v1.5

Check the --help command to get a description for all parameters.

infinity_emb --help

Launch the CLI using a pre-built docker container (recommended)

Instead of installing the CLI via pip, you may also use docker to run infinity. Make sure you mount your accelerator, i.e. install nvidia-docker and activate with --gpus all.

port=7997
docker run -it --gpus all -p $port:$port michaelf34/infinity:latest --model-name-or-path BAAI/bge-small-en-v1.5 --port $port

The download path at runtime can be controlled via the environment variable HF_HOME.

Launch it via the Python API

Instead of the cli & RestAPI you can directly interface with the Python API. This gives you most flexibility. The Python API builds on asyncio with its await/async features, to allow concurrent processing of requests.

import asyncio
from infinity_emb import AsyncEmbeddingEngine, EngineArgs

sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
engine = AsyncEmbeddingEngine.from_args(EngineArgs(model_name_or_path = "BAAI/bge-small-en-v1.5", engine="torch"))

async def main(): 
    async with engine: # engine starts with e

Name		Name	Last commit message	Last commit date
Latest commit History 501 Commits
.github		.github
docs		docs
libs/infinity_emb		libs/infinity_emb
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Infinity ♾️

Why Infinity

Infinity demo

Latest News 🔥

Getting started

Launch the cli via pip install

Launch the CLI using a pre-built docker container (recommended)

Launch it via the Python API

License

MoaazZaky/infinity

Folders and files

Latest commit

History

Repository files navigation

Infinity ♾️

Why Infinity

Infinity demo

Latest News 🔥

Getting started

Launch the cli via pip install

Launch the CLI using a pre-built docker container (recommended)

Launch it via the Python API