This repository provides the code for training our video retrieval cross-modal architecture. Our approach is described in the paper "Multi-modal Transformer for Video Retrieval" [arXiv, webpage]
Our proposed Multi-Modal Transformer (MMT) aggregates sequences of multi-modal features (e.g. appearance, motion, audio, OCR, etc.) from a video. It then embeds the aggregated multi-modal feature to a shared space with text for retrieval. It achieves state-of-the-art performance on MSRVTT, ActivityNet and LSMDC datasets.
git clone https://github.com/gabeur/mmt.git
- Python 3.7
- Pytorch 1.4.0
- Transformers 3.1.0
- Numpy 1.18.1
cd mmt
# Install the requirements
pip install -r requirements.txt
In order to reproduce the results of our ECCV 2020 Spotlight paper, please first download the video features from this page by running the following commands:
# Create and move to mmt/data directory
mkdir data
cd data
# Download the video features
wget https://pascal.inrialpes.fr/data2/vgabeur/video-features/MSRVTT.tar.gz
wget https://pascal.inrialpes.fr/data2/vgabeur/video-features/activity-net.tar.gz
wget https://pascal.inrialpes.fr/data2/vgabeur/video-features/LSMDC.tar.gz
# Extract the video features
tar -xvf MSRVTT.tar.gz
tar -xvf activity-net.tar.gz
tar -xvf LSMDC.tar.gz
You can then run the following scripts:
Training from scratch
python -m train --config configs_pub/eccv20/MSRVTT_jsfusion_trainval.json
Training from scratch
python -m train --config configs_pub/eccv20/ActivityNet_val1_trainval.json
Training from scratch
python -m train --config configs_pub/eccv20/LSMDC_full_trainval.json
If you find this code useful or use the "s3d"(motion) video features, please consider citing:
@inproceedings{gabeur2020mmt,
TITLE = {{Multi-modal Transformer for Video Retrieval}},
AUTHOR = {Gabeur, Valentin and Sun, Chen and Alahari, Karteek and Schmid, Cordelia},
BOOKTITLE = {{European Conference on Computer Vision (ECCV)}},
YEAR = {2020}
}
The features "face", "ocr", "rgb"(appearance), "scene" and "speech" were extracted by the authors of Collaborative Experts. If you use those features, please consider citing:
@inproceedings{Liu2019a,
author = {Liu, Y. and Albanie, S. and Nagrani, A. and Zisserman, A.},
booktitle = {British Machine Vision Conference},
title = {Use What You Have: Video retrieval using representations from collaborative experts},
date = {2019}
}
Our code is structured following the template proposed by @victoresque. Our code is based on the implementation of Collaborative Experts, Transformers and Mixture of Embedding Experts.