Table of Contents
model name | #param. | precision | data | batch size | IN-1K zero-shot top-1 | weight |
---|---|---|---|---|---|---|
eva_clip_psz14 |
1.1B | fp16 |
LAION-400M | 41K | 78.5 | 🤗 HF link (2GB ) |
We choose to train a 1.1B CLIP model, not because it is easy, but because it is hard. Please refer to this note for a glance at the challenges in training very large CLIP.
To our knowledge, EVA-CLIP is the largest performant open-sourced CLIP model evaluated via zero-shot classification performance, especially on mainstream classification benchmarks such as ImageNet along with its variants. For more details about EVA-CLIP, please refer to Section 2.3.5 of our paper.
We hope open-sourcing EVA-CLIP can facilitate future research in multi-modal learning, representation learning, AIGC, etc, and we hope our solution for scaling up CLIPs can provide insight for practitioners studying large foundation models.
model | zero-shot @ 224px | linear probing @ 224px | linear probing @ 336px | fine-tuning @ 224px | fine-tuning @ 336px |
---|---|---|---|---|---|
EVA-CLIP | 78.5 (weight | log) | 86.5 (weight | log) | 86.5 (weight | log) | 89.1 (weight | log) | 89.4 (weight | log) |
EVA-CLIP achieves the state-of-the-art top-1 accuracy on ImageNet-1K among all self-supervised learning approaches. We will provide instructions for re-producing these results soon.
The top-1 accuracy of ImageNet-1K variants and ObjectNet.
model | IN-1K | IN-V2 | IN-Adv. | IN-Ren. | IN-Ske. | ObjectNet |
---|---|---|---|---|---|---|
OpenAI CLIP-L | 75.55 | 69.86 | 70.76 | 87.83 | 59.58 | 68.98 |
Open CLIP-H | 77.96 | 70.87 | 59.33 | 89.33 | 66.58 | 69.71 |
Open CLIP-g | 76.65 | 69.56 | 57.19 | 88.69 | 65.17 | 67.53 |
EVA CLIP-g | 78.53 | 71.52 | 73.59 | 92.5 | 67.31 | 72.33 |
The performance of video action recognition benchmarks.
model | UCF-101 | Kinetics-400 | Kinetics-600 | Kinetics-700 |
---|---|---|---|---|
OpenAI CLIP-L | 76.39 | 64.47 | 64.21 | 57.68 |
Open CLIP-H | 78.16 | 63.06 | 63.58 | 56.09 |
Open CLIP-g | 77.73 | 61.69 | 62.16 | 54.99 |
EVA CLIP-g | 76.05 | 65.23 | 64.38 | 58.4 |
For video action recognition, we sample only a single center frame each video, turning it into an image classification task. Following the conventional settings, we report the top-1 accuracy for UCF-101 and the mean of top-1 and top-5 accuracy for Kinetics-400/600/700.
Dataset | Model | Text-to-Image Retrival | Image-to-Text Retrival | ||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
Flickr30k | OpenAI CLIP-L | 65.18 | 87.28 | 92 | 85.2 | 97.3 | 99 |
Open CLIP-H | 77.78 | 94.14 | 96.62 | 90.8 | 99.3 | 99.7 | |
Open CLIP-g | 76.52 | 93.62 | 96.28 | 90.8 | 99.1 | 99.8 | |
EVA CLIP-g | 72.64 | 91.6 | 95.12 | 88.3 | 98.3 | 99.3 | |
MSCOCO | OpenAI CLIP-L | 36.51 | 61.01 | 71.11 | 56.34 | 79.32 | 86.66 |
Open CLIP-H | 49.47 | 73.4 | 81.53 | 65.96 | 86.06 | 91.9 | |
Open CLIP-g | 47.99 | 72.37 | 80.75 | 64.96 | 85.3 | 91.46 | |
EVA CLIP-g | 44.07 | 68.5 | 77.33 | 61.76 | 83.28 | 89.96 |
The zero-shot retrieval performance of EVA-CLIP is relatively inferior to the Open CLIP-H / -g counterpart. We speculate there are two main reasons:
- The size / capacity of the language tower in EVA-CLIP is much smaller / weaker than Open CLIP-H and Open CLIP-g, i.e.,
124M
v.s.354M
, and is only~1/8
of the vision tower. Meanwhile, retrieval tasks depend more on the capacity of the language branch compared with classification tasks.- Retrieval tasks seem benefit more from the training dataset size (LAION-2B used by Open CLIP), while we only leverage LAION-400M for EVA-CLIP training. Nevertheless, it is hard to make a head-to-head comparison between different CLIP models. In the future, we will further scale up the language encoder & training data to improve the retrieval performance.
The use of EVA-CLIP is similar to OpenAI CLIP and Open CLIP. Here we provide a showcase of zero-shot image classification.
First, install PyTorch 1.7.1 (or later) and torchvision, as well as small additional dependencies, and then install this repo as a Python package. On a CUDA GPU machine, the following will do the trick:
$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
The training code of our 1.1B EVA-CLIP will be available at FlagAI. Please stay tuned.
An example:
import torch
from eva_clip import build_eva_model_and_transforms
from clip import tokenize
from PIL import Image
eva_clip_path = "/path/to/eva_clip_psz14.pt" # https://huggingface.co/BAAI/EVA/blob/main/eva_clip_psz14.pt
model_name = "EVA_CLIP_g_14"
image_path = "CLIP.png"
caption = ["a diagram", "a dog", "a cat"]
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = build_eva_model_and_transforms(model_name, pretrained=eva_clip_path)
model = model.to(device)
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
text = tokenize(caption).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs) # prints: [1.0000e+00, 2.0857e-10, 4.8534e-12]
EVA-CLIP is built with OpenAI CLIP, Open CLIP and CLIP Benchmark. Thanks for their awesome work!