Welcome to UForm, a multi-modal AI library that's as versatile as it is efficient. Imagine encoding text, images, and soon, audio, video, and JSON documents into a shared Semantic Vector Space. With compact custom pre-trained transformer models, all of this can run anywhere—from your server farm down to your smartphone.
-
Tiny Embeddings: With just 256 dimensions, our embeddings are lean and fast, making your inference 1.5-3x quicker compared to other CLIP-like models.
-
Quantization Magic: Our models are trained to be quantization-aware, letting you downcast embeddings from
f32
toi8
without losing much accuracy. -
Balanced Training: Our models are cosmopolitan, trained on a uniquely balanced diet of English and other languages. This gives us an edge in languages often overlooked by other models, from Hebrew and Armenian to Hindi and Arabic.
-
Hardware Friendly: Whether it's CoreML, ONNX, or specialized AI hardware like Graphcore IPUs, we've got you covered.
Model | Description | Languages | URL |
---|---|---|---|
unum-cloud/uform-vl-english |
2 layers text encoder, ViT-B/16, 2 layers multimodal part | 1 | weights |
unum-cloud/uform-vl-multilingual |
8 layers text encoder, ViT-B/16, 4 layers multimodal part | 12 | weights |
unum-cloud/uform-vl-multilingual-v2 |
8 layers text encoder, ViT-B/16, 4 layers multimodal part | 21 | weights |
Install UForm via pip:
pip install uform
import uform
model = uform.get_model('unum-cloud/uform-vl-english') # Just English
model = uform.get_model('unum-cloud/uform-vl-multilingual-v2') # 21 Languages
from PIL import Image
text = 'a small red panda in a zoo'
image = Image.open('red_panda.jpg')
image_data = model.preprocess_image(image)
text_data = model.preprocess_text(text)
image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)
# Features can be used to produce joint multimodal embeddings
joint_embedding = model.encode_multimodal(
image_features=image_features,
text_features=text_features,
attention_mask=text_data['attention_mask']
)
import uform
model = uform.get_model('unum-cloud/uform-gen')
import uform
model = uform.get_model('unum-cloud/uform-vl-english')
model_image = nn.DataParallel(model.image_encoder)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_image.to(device)
_, res = model_image(images, 0)
On RTX 3090, the following performance is expected from uform
on text encoding.
Model | Multi-lingual | Model Size | Speed | Speedup |
---|---|---|---|---|
bert-base-uncased |
No | 109'482'240 | 1'612 seqs/s | |
distilbert-base-uncased |
No | 66'362'880 | 3'174 seqs/s | x 1.96 |
sentence-transformers/all-MiniLM-L12-v2 |
Yes | 33'360'000 | 3'604 seqs/s | x 2.24 |
sentence-transformers/all-MiniLM-L6-v2 |
No | 22'713'216 | 6'107 seqs/s | x 3.79 |
unum-cloud/uform-vl-multilingual-v2 |
Yes | 120'090'242 | 6'809 seqs/s | x 4.22 |
Evaluating the unum-cloud/uform-vl-multilingual-v2
model, one can expect the following metrics for text-to-image search, compared against xlm-roberta-base-ViT-B-32
OpenCLIP model.
The @ 1
, @ 5
, and @ 10
showcase the quality of top-1, top-5, and top-10 search results, compared to human-annotated dataset.
Higher is better.
Language | OpenCLIP @ 1 | UForm @ 1 | OpenCLIP @ 5 | UForm @ 5 | OpenCLIP @ 10 | UForm @ 10 | Speakers |
---|---|---|---|---|---|---|---|
Arabic 🇸🇦 | 22.7 | 31.7 | 44.9 | 57.8 | 55.8 | 69.2 | 274 M |
Armenian 🇦🇲 | 5.6 | 22.0 | 14.3 | 44.7 | 20.2 | 56.0 | 4 M |
Chinese 🇨🇳 | 27.3 | 32.2 | 51.3 | 59.0 | 62.1 | 70.5 | 1'118 M |
English 🇺🇸 | 37.8 | 37.7 | 63.5 | 65.0 | 73.5 | 75.9 | 1'452 M |
French 🇫🇷 | 31.3 | 35.4 | 56.5 | 62.6 | 67.4 | 73.3 | 274 M |
German 🇩🇪 | 31.7 | 35.1 | 56.9 | 62.2 | 67.4 | 73.3 | 134 M |
Hebrew 🇮🇱 | 23.7 | 26.7 | 46.3 | 51.8 | 57.0 | 63.5 | 9 M |
Hindi 🇮🇳 | 20.7 | 31.3 | 42.5 | 57.9 | 53.7 | 69.6 | 602 M |
Indonesian 🇮🇩 | 26.9 | 30.7 | 51.4 | 57.0 | 62.7 | 68.6 | 199 M |
Italian 🇮🇹 | 31.3 | 34.9 | 56.7 | 62.1 | 67.1 | 73.1 | 67 M |
Japanese 🇯🇵 | 27.4 | 32.6 | 51.5 | 59.2 | 62.6 | 70.6 | 125 M |
Korean 🇰🇷 | 24.4 | 31.5 | 48.1 | 57.8 | 59.2 | 69.2 | 81 M |
Persian 🇮🇷 | 24.0 | 28.8 | 47.0 | 54.6 | 57.8 | 66.2 | 77 M |
Polish 🇵🇱 | 29.2 | 33.6 | 53.9 | 60.1 | 64.7 | 71.3 | 41 M |
Portuguese 🇵🇹 | 31.6 | 32.7 | 57.1 | 59.6 | 67.9 | 71.0 | 257 M |
Russian 🇷🇺 | 29.9 | 33.9 | 54.8 | 60.9 | 65.8 | 72.0 | 258 M |
Spanish 🇪🇸 | 32.6 | 35.6 | 58.0 | 62.8 | 68.8 | 73.7 | 548 M |
Thai 🇹🇭 | 21.5 | 28.7 | 43.0 | 54.6 | 53.7 | 66.0 | 61 M |
Turkish 🇹🇷 | 25.5 | 33.0 | 49.1 | 59.6 | 60.3 | 70.8 | 88 M |
Ukranian 🇺🇦 | 26.0 | 30.6 | 49.9 | 56.7 | 60.9 | 68.1 | 41 M |
Vietnamese 🇻🇳 | 25.4 | 28.3 | 49.2 | 53.9 | 60.3 | 65.5 | 85 M |
Mean | 26.5±6.4 | 31.8±3.5 | 49.8±9.8 | 58.1±4.5 | 60.4±10.6 | 69.4±4.3 | - |
Google Translate | 27.4±6.3 | 31.5±3.5 | 51.1±9.5 | 57.8±4.4 | 61.7±10.3 | 69.1±4.3 | - |
Microsoft Translator | 27.2±6.4 | 31.4±3.6 | 50.8±9.8 | 57.7±4.7 | 61.4±10.6 | 68.9±4.6 | - |
Meta NLLB | 24.9±6.7 | 32.4±3.5 | 47.5±10.3 | 58.9±4.5 | 58.2±11.2 | 70.2±4.3 | - |
Lacking a broad enough evaluation dataset, we translated the COCO Karpathy test split with multiple public and proprietary translation services, averaging the scores across all sets, and breaking them down in the bottom section. Check out the
unum-cloud/coco-sm
repository for details.
There are two options to calculate semantic compatibility between an image and a text: Cosine Similarity and Matching Score.
import torch.nn.functional as F
similarity = F.cosine_similarity(image_embedding, text_embedding)
The similarity
will belong to the [-1, 1]
range, 1
meaning the absolute match.
Unlike cosine similarity, unimodal embedding is not enough.
Joint embedding will be needed, and the resulting score
will belong to the [0, 1]
range, 1
meaning the absolute match.
score = model.get_matching_scores(joint_embedding)
All models and code available under Apache-2.0 available in Model LICENSE file