Contrastive Language-Image Pre-Training with EVA (EVA-CLIP)

Table of Contents

Contrastive Language-Image Pre-Training with EVA (EVA-CLIP)

Model Card

model name	#param.	precision	data	batch size	IN-1K zero-shot top-1	weight
`eva_clip_psz14`	1.1B	`fp16`	LAION-400M	41K	78.5	🤗 HF link (`2GB`)

We choose to train a 1.1B CLIP model, not because it is easy, but because it is hard. Please refer to this note for a glance at the challenges in training very large CLIP.

To our knowledge, EVA-CLIP is the largest performant open-sourced CLIP model evaluated via zero-shot classification performance, especially on mainstream classification benchmarks such as ImageNet along with its variants. For more details about EVA-CLIP, please refer to Section 2.3.5 of our paper.

We hope open-sourcing EVA-CLIP can facilitate future research in multi-modal learning, representation learning, AIGC, etc, and we hope our solution for scaling up CLIPs can provide insight for practitioners studying large foundation models.

Performance of EVA-CLIP Vision Encoder on ImageNet-1K

model	zero-shot @ 224px	linear probing @ 224px	linear probing @ 336px	fine-tuning @ 224px	fine-tuning @ 336px
EVA-CLIP	78.5 (weight \| log)	86.5 (weight \| log）	86.5 (weight \| log)	89.1 (weight \| log)	89.4 (weight \| log)

EVA-CLIP achieves the state-of-the-art top-1 accuracy on ImageNet-1K among all self-supervised learning approaches. We will provide instructions for re-producing these results soon.

EVA-CLIP Zero-shot Evaluation Results

All 35 Benchmark Results in Details

Zero-shot Image Classification Evaluation

The top-1 accuracy of ImageNet-1K variants and ObjectNet.

model	IN-1K	IN-V2	IN-Adv.	IN-Ren.	IN-Ske.	ObjectNet
OpenAI CLIP-L	75.55	69.86	70.76	87.83	59.58	68.98
Open CLIP-H	77.96	70.87	59.33	89.33	66.58	69.71
Open CLIP-g	76.65	69.56	57.19	88.69	65.17	67.53
EVA CLIP-g	78.53	71.52	73.59	92.5	67.31	72.33

Zero-shot Video Action Recognition Evaluation

The performance of video action recognition benchmarks.

model	UCF-101	Kinetics-400	Kinetics-600	Kinetics-700
OpenAI CLIP-L	76.39	64.47	64.21	57.68
Open CLIP-H	78.16	63.06	63.58	56.09
Open CLIP-g	77.73	61.69	62.16	54.99
EVA CLIP-g	76.05	65.23	64.38	58.4

For video action recognition, we sample only a single center frame each video, turning it into an image classification task. Following the conventional settings, we report the top-1 accuracy for UCF-101 and the mean of top-1 and top-5 accuracy for Kinetics-400/600/700.

Zero-shot Retrieval Evaluation

Dataset	Model	Text-to-Image Retrival			Image-to-Text Retrival
Dataset	Model	R@1	R@5	R@10	R@1	R@5	R@10
Flickr30k	OpenAI CLIP-L	65.18	87.28	92	85.2	97.3	99
	Open CLIP-H	77.78	94.14	96.62	90.8	99.3	99.7
	Open CLIP-g	76.52	93.62	96.28	90.8	99.1	99.8
	EVA CLIP-g	72.64	91.6	95.12	88.3	98.3	99.3
MSCOCO	OpenAI CLIP-L	36.51	61.01	71.11	56.34	79.32	86.66
	Open CLIP-H	49.47	73.4	81.53	65.96	86.06	91.9
	Open CLIP-g	47.99	72.37	80.75	64.96	85.3	91.46
	EVA CLIP-g	44.07	68.5	77.33	61.76	83.28	89.96

The zero-shot retrieval performance of EVA-CLIP is relatively inferior to the Open CLIP-H / -g counterpart. We speculate there are two main reasons:

The size / capacity of the language tower in EVA-CLIP is much smaller / weaker than Open CLIP-H and Open CLIP-g, i.e., 124M v.s. 354M, and is only ~1/8 of the vision tower. Meanwhile, retrieval tasks depend more on the capacity of the language branch compared with classification tasks.

Retrieval tasks seem benefit more from the training dataset size (LAION-2B used by Open CLIP), while we only leverage LAION-400M for EVA-CLIP training. Nevertheless, it is hard to make a head-to-head comparison between different CLIP models. In the future, we will further scale up the language encoder & training data to improve the retrieval performance.

Usage

The use of EVA-CLIP is similar to OpenAI CLIP and Open CLIP. Here we provide a showcase of zero-shot image classification.

First, install PyTorch 1.7.1 (or later) and torchvision, as well as small additional dependencies, and then install this repo as a Python package. On a CUDA GPU machine, the following will do the trick:

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm

The training code of our 1.1B EVA-CLIP will be available at FlagAI. Please stay tuned.

An example:

import torch
from eva_clip import build_eva_model_and_transforms
from clip import tokenize
from PIL import Image

eva_clip_path = "/path/to/eva_clip_psz14.pt" # https://huggingface.co/BAAI/EVA/blob/main/eva_clip_psz14.pt
model_name = "EVA_CLIP_g_14"
image_path = "CLIP.png"
caption = ["a diagram", "a dog", "a cat"]

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = build_eva_model_and_transforms(model_name, pretrained=eva_clip_path)
model = model.to(device)

image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
text = tokenize(caption).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [1.0000e+00, 2.0857e-10, 4.8534e-12]

Acknowledgement

EVA-CLIP is built with OpenAI CLIP, Open CLIP and CLIP Benchmark. Thanks for their awesome work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Contrastive Language-Image Pre-Training with EVA (EVA-CLIP)

Model Card

Performance of EVA-CLIP Vision Encoder on ImageNet-1K

EVA-CLIP Zero-shot Evaluation Results

All 35 Benchmark Results in Details

Zero-shot Image Classification Evaluation

Zero-shot Video Action Recognition Evaluation

Zero-shot Retrieval Evaluation

Usage

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

Contrastive Language-Image Pre-Training with EVA (EVA-CLIP)

Model Card

Performance of EVA-CLIP Vision Encoder on ImageNet-1K

EVA-CLIP Zero-shot Evaluation Results

All 35 Benchmark Results in Details

Zero-shot Image Classification Evaluation

Zero-shot Video Action Recognition Evaluation

Zero-shot Retrieval Evaluation

Usage

Acknowledgement