# CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP _Andreas Fürst* 1, Elisabeth Rumetshofer* 1, Johannes Lehner1, Viet Tran1, Fei Tang3, Hubert Ramsauer1, David Kreil2, Michael Kopp2, Günter Klambauer1, Angela Bitto-Nemling1, Sepp Hochreiter1 2_ 1 ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria 2 Institute of Advanced Research in Artificial Intelligence (IARAI) 3 HERE Technologies * Equal contribution --- #### Detailed blog post on this paper at [this link](https://ml-jku.github.io/cloob). #### The full paper is available [here](https://arxiv.org/abs/2110.11316). --- # Implementation of CLOOB This repository contains the implemenation of CLOOB used to obtain the results reported in the paper. The implementation is based on [OpenCLIP](https://github.com/mlfoundations/open_clip), an open source implementation of OpenAI's [CLIP](https://arxiv.org/abs/2103.00020). ## Setup We provide an 'environment.yml' file to set up a conda environment with all required packages. Run the following command to clone the repository and create the environment. ```bash # Clone repository and swtich into the directory git clone https://github.com/ml-jku/cloob cd cloob # Create the environment and activate it conda env create --file environment.yml conda activate cloob # Add the directory to the PYTHONPATH environment variable export PYTHONPATH="$PYTHONPATH:$PWD/src" ``` ## Data For pre-training we use the two datasets supported by OpenCLIP, namely Conceptual Captions and YFCC. ### Conceptual Captions OpenCLIP already provides a script to download and prepare the Conceptual Captions dataset, which contains 2.89M training images and 13k validation images. First, download the [Conceptual Captions URLs](https://ai.google.com/research/ConceptualCaptions/download) and then run the script `gather_cc.py`. ```bash python3 src/data/gather_cc.py path/to/Train_GCC-training.tsv path/to/Validation_GCC-1.1.0-Validation.tsv ``` ### YFCC We use the same subset of ~15M images from the [YFCC100M](http://mmcommons.org) dataset as CLIP. They provide a list of (line number, photo identifier, photo hash) of each image contained in this subset [here](https://openaipublic.azureedge.net/clip/data/yfcc100m_subset_data.tsv.bz2). For more information see [YFCC100m Subset](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md) on OpenAI's github. ### Downstream Tasks In the paper we report results on several downstream tasks. Except for ImageNet we provide links to already pre-processed versions (where necessary) of the respective test set. | Dataset | Description | Official | Processed | |---------------|-------------|-------------------------------------------------------------------------|-----------| | Birdsnap | This dataset contains images of North American bird species, however
our dataset is smaller than reported in CLIP as some samples are no longer available. | [Link](http://thomasberg.org/datasets/birdsnap/1.1/birdsnap.tgz) | [Link](https://ml.jku.at/research/CLOOB/downloads/zeroshot_datasets/birdsnap.zip) | | Country211 | This dataset was published in CLIP and is a small subset of the YFCC100m dataset.
It consists of photos that can be assigned to 211 countries via GPS coordinates.
For each country 200 photos are sampled for the training set and 100 for testing. | [Link](https://github.com/openai/CLIP/blob/main/data/country211.md) | [Link](https://ml.jku.at/research/CLOOB/downloads/zeroshot_datasets/country211.zip) | | Flowers102 | Images of 102 flower categories commonly occuring in the United Kingdom were collected.
Several classes are very similar and there is a large variation in scale, pose and lighting. | [Link](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/) | [Link](https://ml.jku.at/research/CLOOB/downloads/zeroshot_datasets/flowers102.zip) | | GTSRB | This dataset was released for a challenge held at the IJCNN 2011.
The dataset contains images of german traffic signs from more than 40 classes. | [Link](https://benchmark.ini.rub.de/gtsrb_news.html) | [Link](https://ml.jku.at/research/CLOOB/downloads/zeroshot_datasets/gtsrb.zip) | | Stanford Cars | This dataset contains images of 196 car models at the level of make,
model and year (e.g. Tesla Model S Sedan 2012). | [Link](http://ai.stanford.edu/~jkrause/cars/car_dataset.html) | [Link](https://ml.jku.at/research/CLOOB/downloads/zeroshot_datasets/stanford_cars.zip) | | UCF101 | The dataset has been created by extracting the middle frame from each video. | [Link](https://www.crcv.ucf.edu/data/UCF101.php) | [Link](https://ml.jku.at/research/CLOOB/downloads/zeroshot_datasets/ucf101.zip) | | ImageNet | This dataset spans 1000 object classes and contains 1,281,167 training images,
50,000 validation images and 100,000 test images. | [Link](https://image-net.org/download.php) | - | | ImageNet v2 | The ImageNetV2 dataset contains new test data for the ImageNet benchmark. | [Link](https://github.com/modestyachts/ImageNetV2) | - | ## Usage In the following there is an example command for pretraining on CC with an effective batch size of 512 when used on 4 GPUs. ```bash python -u src/training/main.py \ --train-data="/conceptual_captions/Train-GCC-training_output.csv" \ --val-data="/conceptual_captions/Validation_GCC-1.1.0-Validation_output.csv" \ --path-data="/conceptual_captions" \ --imagenet-val="/imagenet/val" \ --warmup 20000 \ --batch-size=128 \ --lr=1e-3 \ --wd=0.1 \ --lr-scheduler="cosine-restarts" \ --restart-cycles=10 \ --epochs=70 \ --method="cloob" \ --init-inv-tau=30 \ --scale-hopfield=8 \ --workers=8 \ --model="RN50" \ --dist-url="tcp://127.0.0.1:6100" \ --batch-size-eval=512 ``` ### Zeroshot evaluation of downstream tasks We provide a [Jupyter notebook](src/notebooks/zeroshot.ipynb) to perform zeroshot evaluation with a trained model. ## LICENSE MIT LICENSE