Skip to content
/ uform Public
forked from unum-cloud/uform

Multi-Modal AI inference library for Multi-Lingual Text, Image, and Video Search, Recommendations, and other Vision-Language tasks, up to 5x faster than OpenAI CLIP 🖼️ & 🖋️

License

Notifications You must be signed in to change notification settings

VoVoR/uform

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UForm

Pocket-Sized Multi-Modal AI
For content generation and understanding


Discord       LinkedIn       Twitter       Blog       GitHub


Welcome to UForm, a multi-modal AI library that's as versatile as it is efficient. Imagine encoding text, images, and soon, audio, video, and JSON documents into a shared Semantic Vector Space. With compact custom pre-trained transformer models, all of this can run anywhere—from your server farm down to your smartphone.

Key Features

  • Tiny Embeddings: With just 256 dimensions, our embeddings are lean and fast, making your inference 1.5-3x quicker compared to other CLIP-like models.

  • Quantization Magic: Our models are trained to be quantization-aware, letting you downcast embeddings from f32 to i8 without losing much accuracy.

  • Balanced Training: Our models are cosmopolitan, trained on a uniquely balanced diet of English and other languages. This gives us an edge in languages often overlooked by other models, from Hebrew and Armenian to Hindi and Arabic.

  • Hardware Friendly: Whether it's CoreML, ONNX, or specialized AI hardware like Graphcore IPUs, we've got you covered.

Model Cards

Model Description Languages URL
unum-cloud/uform-vl-english 2 layers text encoder, ViT-B/16, 2 layers multimodal part 1 weights
unum-cloud/uform-vl-multilingual 8 layers text encoder, ViT-B/16, 4 layers multimodal part 12 weights
unum-cloud/uform-vl-multilingual-v2 8 layers text encoder, ViT-B/16, 4 layers multimodal part 21 weights

Installation

Install UForm via pip:

pip install uform

Quick Start

Encoding models

Loading a Model

import uform

model = uform.get_model('unum-cloud/uform-vl-english') # Just English
model = uform.get_model('unum-cloud/uform-vl-multilingual-v2') # 21 Languages

Encoding Data

from PIL import Image

text = 'a small red panda in a zoo'
image = Image.open('red_panda.jpg')

image_data = model.preprocess_image(image)
text_data = model.preprocess_text(text)

image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)

# Features can be used to produce joint multimodal embeddings
joint_embedding = model.encode_multimodal(
    image_features=image_features,
    text_features=text_features,
    attention_mask=text_data['attention_mask']
)

Generative Models

import uform

model = uform.get_model('unum-cloud/uform-gen')

Multi-GPU

import uform

model = uform.get_model('unum-cloud/uform-vl-english')
model_image = nn.DataParallel(model.image_encoder)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_image.to(device)

_, res = model_image(images, 0)

Models Evaluation

Speed

On RTX 3090, the following performance is expected from uform on text encoding.

Model Multi-lingual Model Size Speed Speedup
bert-base-uncased No 109'482'240 1'612 seqs/s
distilbert-base-uncased No 66'362'880 3'174 seqs/s x 1.96
sentence-transformers/all-MiniLM-L12-v2 Yes 33'360'000 3'604 seqs/s x 2.24
sentence-transformers/all-MiniLM-L6-v2 No 22'713'216 6'107 seqs/s x 3.79
unum-cloud/uform-vl-multilingual-v2 Yes 120'090'242 6'809 seqs/s x 4.22

Accuracy

Evaluating the unum-cloud/uform-vl-multilingual-v2 model, one can expect the following metrics for text-to-image search, compared against xlm-roberta-base-ViT-B-32 OpenCLIP model. The @ 1 , @ 5 , and @ 10 showcase the quality of top-1, top-5, and top-10 search results, compared to human-annotated dataset. Higher is better.

Language OpenCLIP @ 1 UForm @ 1 OpenCLIP @ 5 UForm @ 5 OpenCLIP @ 10 UForm @ 10 Speakers
Arabic 🇸🇦 22.7 31.7 44.9 57.8 55.8 69.2 274 M
Armenian 🇦🇲 5.6 22.0 14.3 44.7 20.2 56.0 4 M
Chinese 🇨🇳 27.3 32.2 51.3 59.0 62.1 70.5 1'118 M
English 🇺🇸 37.8 37.7 63.5 65.0 73.5 75.9 1'452 M
French 🇫🇷 31.3 35.4 56.5 62.6 67.4 73.3 274 M
German 🇩🇪 31.7 35.1 56.9 62.2 67.4 73.3 134 M
Hebrew 🇮🇱 23.7 26.7 46.3 51.8 57.0 63.5 9 M
Hindi 🇮🇳 20.7 31.3 42.5 57.9 53.7 69.6 602 M
Indonesian 🇮🇩 26.9 30.7 51.4 57.0 62.7 68.6 199 M
Italian 🇮🇹 31.3 34.9 56.7 62.1 67.1 73.1 67 M
Japanese 🇯🇵 27.4 32.6 51.5 59.2 62.6 70.6 125 M
Korean 🇰🇷 24.4 31.5 48.1 57.8 59.2 69.2 81 M
Persian 🇮🇷 24.0 28.8 47.0 54.6 57.8 66.2 77 M
Polish 🇵🇱 29.2 33.6 53.9 60.1 64.7 71.3 41 M
Portuguese 🇵🇹 31.6 32.7 57.1 59.6 67.9 71.0 257 M
Russian 🇷🇺 29.9 33.9 54.8 60.9 65.8 72.0 258 M
Spanish 🇪🇸 32.6 35.6 58.0 62.8 68.8 73.7 548 M
Thai 🇹🇭 21.5 28.7 43.0 54.6 53.7 66.0 61 M
Turkish 🇹🇷 25.5 33.0 49.1 59.6 60.3 70.8 88 M
Ukranian 🇺🇦 26.0 30.6 49.9 56.7 60.9 68.1 41 M
Vietnamese 🇻🇳 25.4 28.3 49.2 53.9 60.3 65.5 85 M
Mean 26.5±6.4 31.8±3.5 49.8±9.8 58.1±4.5 60.4±10.6 69.4±4.3 -
Google Translate 27.4±6.3 31.5±3.5 51.1±9.5 57.8±4.4 61.7±10.3 69.1±4.3 -
Microsoft Translator 27.2±6.4 31.4±3.6 50.8±9.8 57.7±4.7 61.4±10.6 68.9±4.6 -
Meta NLLB 24.9±6.7 32.4±3.5 47.5±10.3 58.9±4.5 58.2±11.2 70.2±4.3 -

Lacking a broad enough evaluation dataset, we translated the COCO Karpathy test split with multiple public and proprietary translation services, averaging the scores across all sets, and breaking them down in the bottom section. Check out the unum-cloud/coco-sm repository for details.

🧰 Additional Tooling

There are two options to calculate semantic compatibility between an image and a text: Cosine Similarity and Matching Score.

Cosine Similarity

import torch.nn.functional as F

similarity = F.cosine_similarity(image_embedding, text_embedding)

The similarity will belong to the [-1, 1] range, 1 meaning the absolute match.

Matching Score

Unlike cosine similarity, unimodal embedding is not enough. Joint embedding will be needed, and the resulting score will belong to the [0, 1] range, 1 meaning the absolute match.

score = model.get_matching_scores(joint_embedding)

License

All models and code available under Apache-2.0 available in Model LICENSE file

About

Multi-Modal AI inference library for Multi-Lingual Text, Image, and Video Search, Recommendations, and other Vision-Language tasks, up to 5x faster than OpenAI CLIP 🖼️ & 🖋️

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%