UForm

Pocket-Sized Multi-Modal AI
For content generation and understanding

Welcome to UForm, a multi-modal AI library that's as versatile as it is efficient. Imagine encoding text, images, and soon, audio, video, and JSON documents into a shared Semantic Vector Space. With compact custom pre-trained transformer models, all of this can run anywhere—from your server farm down to your smartphone.

Key Features

Tiny Embeddings: With just 256 dimensions, our embeddings are lean and fast, making your inference 1.5-3x quicker compared to other CLIP-like models.
Quantization Magic: Our models are trained to be quantization-aware, letting you downcast embeddings from f32 to i8 without losing much accuracy.
Balanced Training: Our models are cosmopolitan, trained on a uniquely balanced diet of English and other languages. This gives us an edge in languages often overlooked by other models, from Hebrew and Armenian to Hindi and Arabic.
Hardware Friendly: Whether it's CoreML, ONNX, or specialized AI hardware like Graphcore IPUs, we've got you covered.

Model Cards

Model	Description	Languages	URL
`unum-cloud/uform-vl-english`	2 layers text encoder, ViT-B/16, 2 layers multimodal part	1	weights
`unum-cloud/uform-vl-multilingual`	8 layers text encoder, ViT-B/16, 4 layers multimodal part	12	weights
`unum-cloud/uform-vl-multilingual-v2`	8 layers text encoder, ViT-B/16, 4 layers multimodal part	21	weights

Installation

Install UForm via pip:

pip install uform

Quick Start

Encoding models

Loading a Model

import uform

model = uform.get_model('unum-cloud/uform-vl-english') # Just English
model = uform.get_model('unum-cloud/uform-vl-multilingual-v2') # 21 Languages

Encoding Data

from PIL import Image

text = 'a small red panda in a zoo'
image = Image.open('red_panda.jpg')

image_data = model.preprocess_image(image)
text_data = model.preprocess_text(text)

image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)

# Features can be used to produce joint multimodal embeddings
joint_embedding = model.encode_multimodal(
    image_features=image_features,
    text_features=text_features,
    attention_mask=text_data['attention_mask']
)

Generative Models

import uform

model = uform.get_model('unum-cloud/uform-gen')

Multi-GPU

import uform

model = uform.get_model('unum-cloud/uform-vl-english')
model_image = nn.DataParallel(model.image_encoder)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_image.to(device)

_, res = model_image(images, 0)

Models Evaluation

Speed

On RTX 3090, the following performance is expected from uform on text encoding.

Model	Multi-lingual	Model Size	Speed	Speedup
`bert-base-uncased`	No	109'482'240	1'612 seqs/s
`distilbert-base-uncased`	No	66'362'880	3'174 seqs/s	x 1.96
`sentence-transformers/all-MiniLM-L12-v2`	Yes	33'360'000	3'604 seqs/s	x 2.24
`sentence-transformers/all-MiniLM-L6-v2`	No	22'713'216	6'107 seqs/s	x 3.79

`unum-cloud/uform-vl-multilingual-v2`	Yes	120'090'242	6'809 seqs/s	x 4.22

Accuracy

Evaluating the unum-cloud/uform-vl-multilingual-v2 model, one can expect the following metrics for text-to-image search, compared against xlm-roberta-base-ViT-B-32 OpenCLIP model. The @ 1 , @ 5 , and @ 10 showcase the quality of top-1, top-5, and top-10 search results, compared to human-annotated dataset. Higher is better.

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10	Speakers
Arabic 🇸🇦	22.7	31.7	44.9	57.8	55.8	69.2	274 M
Armenian 🇦🇲	5.6	22.0	14.3	44.7	20.2	56.0	4 M
Chinese 🇨🇳	27.3	32.2	51.3	59.0	62.1	70.5	1'118 M
English 🇺🇸	37.8	37.7	63.5	65.0	73.5	75.9	1'452 M
French 🇫🇷	31.3	35.4	56.5	62.6	67.4	73.3	274 M
German 🇩🇪	31.7	35.1	56.9	62.2	67.4	73.3	134 M
Hebrew 🇮🇱	23.7	26.7	46.3	51.8	57.0	63.5	9 M
Hindi 🇮🇳	20.7	31.3	42.5	57.9	53.7	69.6	602 M
Indonesian 🇮🇩	26.9	30.7	51.4	57.0	62.7	68.6	199 M
Italian 🇮🇹	31.3	34.9	56.7	62.1	67.1	73.1	67 M
Japanese 🇯🇵	27.4	32.6	51.5	59.2	62.6	70.6	125 M
Korean 🇰🇷	24.4	31.5	48.1	57.8	59.2	69.2	81 M
Persian 🇮🇷	24.0	28.8	47.0	54.6	57.8	66.2	77 M
Polish 🇵🇱	29.2	33.6	53.9	60.1	64.7	71.3	41 M
Portuguese 🇵🇹	31.6	32.7	57.1	59.6	67.9	71.0	257 M
Russian 🇷🇺	29.9	33.9	54.8	60.9	65.8	72.0	258 M
Spanish 🇪🇸	32.6	35.6	58.0	62.8	68.8	73.7	548 M
Thai 🇹🇭	21.5	28.7	43.0	54.6	53.7	66.0	61 M
Turkish 🇹🇷	25.5	33.0	49.1	59.6	60.3	70.8	88 M
Ukranian 🇺🇦	26.0	30.6	49.9	56.7	60.9	68.1	41 M
Vietnamese 🇻🇳	25.4	28.3	49.2	53.9	60.3	65.5	85 M

Mean	26.5±6.4	31.8±3.5	49.8±9.8	58.1±4.5	60.4±10.6	69.4±4.3	-
Google Translate	27.4±6.3	31.5±3.5	51.1±9.5	57.8±4.4	61.7±10.3	69.1±4.3	-
Microsoft Translator	27.2±6.4	31.4±3.6	50.8±9.8	57.7±4.7	61.4±10.6	68.9±4.6	-
Meta NLLB	24.9±6.7	32.4±3.5	47.5±10.3	58.9±4.5	58.2±11.2	70.2±4.3	-

Lacking a broad enough evaluation dataset, we translated the COCO Karpathy test split with multiple public and proprietary translation services, averaging the scores across all sets, and breaking them down in the bottom section. Check out the unum-cloud/coco-sm repository for details.

🧰 Additional Tooling

There are two options to calculate semantic compatibility between an image and a text: Cosine Similarity and Matching Score.

Cosine Similarity

import torch.nn.functional as F

similarity = F.cosine_similarity(image_embedding, text_embedding)

The similarity will belong to the [-1, 1] range, 1 meaning the absolute match.

Matching Score

Unlike cosine similarity, unimodal embedding is not enough. Joint embedding will be needed, and the resulting score will belong to the [0, 1] range, 1 meaning the absolute match.

score = model.get_matching_scores(joint_embedding)

License

All models and code available under Apache-2.0 available in Model LICENSE file

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
docs		docs
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UForm

Pocket-Sized Multi-Modal AI
For content generation and understanding

Key Features

Model Cards

Installation

Quick Start

Encoding models

Loading a Model

Encoding Data

Generative Models

Multi-GPU

Models Evaluation

Speed

Accuracy

🧰 Additional Tooling

Cosine Similarity

Matching Score

License

About

Releases

Packages

Languages

License

VoVoR/uform

Folders and files

Latest commit

History

Repository files navigation

UForm

Pocket-Sized Multi-Modal AI For content generation and understanding

Key Features

Model Cards

Installation

Quick Start

Encoding models

Loading a Model

Encoding Data

Generative Models

Multi-GPU

Models Evaluation

Speed

Accuracy

🧰 Additional Tooling

Cosine Similarity

Matching Score

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Pocket-Sized Multi-Modal AI
For content generation and understanding

Packages