Skip to content

Official repository for "Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting" [CVPR 2023]

License

Notifications You must be signed in to change notification settings

TalalWasim/Vita-CLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Official repository for Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting [CVPR 2023]

Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Mubarak Shah

Paper Link

Abstract: Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance.

intro_image

Updates:

April 24 2023: Released supervised training code for Vita-CLIP (stay tuned for zeroshot evaluation scripts and pretrained models)

Environment Setup

Refer to requirements.txt for installing all python dependencies. We use python 3.8.13 with pytorch 1.14.0.

Supervised Training

Dataset Preparation

We download the official version of Kinetics-400 from here and videos are resized using code here.

We expect that --train_list_path and --val_list_path command line arguments to be a data list file of the following format

<path_1>,<label_1>
<path_2>,<label_2>
...
<path_n>,<label_n>

where <path_i> points to a video file, and <label_i> is an integer between 0 and num_classes - 1. --num_classes should also be specified in the command line argument.

Additionally, <path_i> might be a relative path when --data_root is specified, and the actual path will be relative to the path passed as --data_root.

The class mappings in the open-source weights are provided at Kinetics-400 class mappings

Download Pretrained CLIP Checkpoint

Download the pretrained CLIP checkpoint and place under the pretrained directory.

Training Instruction

For supervised training on the Kinetics-400 dataset, use the train script in the train_scripts directory. Modify the --train_list_path, --train_list_val according to the data location and modify the --backbone_path according to location where the pretrained checkpoint was downloaded and stored.

Zeroshot Evaluation

Scripts for zeroshot evaluation will be released soon

Pretrained Models

Vita-CLIP-B checkpoint can be downloaded here.

Citation

If you find our work, this repository, or pretrained models useful, please consider giving a star ⭐ and citation.

@inproceedings{wasim2023vitaclip,
    title={Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting}, 
    author={Syed Talal Wasim and Muzammal Naseer and Salman Khan and Fahad Shahbaz Khan and Mubarak Shah},
    booktitle={CVPR},
    year={2023}
  }

Acknowledgements

Our code is based on EVL and XCLIP repositories. We thank the authors for releasing their code. If you use our model, please consider citing these works as well.

About

Official repository for "Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting" [CVPR 2023]

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published