Skip to content
forked from MCG-NJU/VideoMAE

[NeurIPS 2022] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

License

Notifications You must be signed in to change notification settings

leomlck/VideoMAE

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Official PyTorch Implementation of VideoMAE.

VideoMAE Framework

Hugging Face Models
PWC
PWC
PWC
PWC
PWC

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong, Yibing Song, Jue Wang, Limin Wang
Nanjing University, Tencent AI Lab

📰 News

[2022.10.20] The pre-trained models and scripts of ViT-S and ViT-H are available!
[2022.10.19] The pre-trained models and scripts on UCF101 are available!
[2022.9.15] VideoMAE is accepted by NeurIPS 2022! 🎉
[2022.8.8] 👀 VideoMAE is integrated into 🤗HuggingFace Transformers now! Hugging Face Models
[2022.7.7] We have updated new results on downstream AVA 2.2 benchmark. Please refer to our paper for details.
[2022.4.24] Code and pre-trained models are available now!
[2022.4.15] The LICENSE of this project has been upgraded to CC-BY-NC 4.0.
[2022.3.24] Code and pre-trained models will be released here. Welcome to watch this repository for the latest updates.

✨ Highlights

🔥 Masked Video Modeling for Video Pre-Training

VideoMAE performs the task of masked video modeling for video pre-training. We propose the extremely high masking ratio (90%-95%) and tube masking strategy to create a challenging task for self-supervised video pre-training.

⚡️ A Simple, Efficient and Strong Baseline in SSVP

VideoMAE uses the simple masked autoencoder and plain ViT backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is much shorter than contrastive learning methods (3.2x speedup). VideoMAE can serve as a simple but strong baseline for future research in self-supervised video pre-training.

😮 High performance, but NO extra data required

VideoMAE works well for video datasets of different scales and can achieve 87.4% on Kinects-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51. To our best knowledge, VideoMAE is the first to achieve the state-of-the-art performance on these four popular benchmarks with the vanilla ViT backbones while doesn't need any extra data or pre-trained models.

🚀 Main Results

✨ Something-Something V2

Method Extra Data Backbone Resolution #Frames x Clips x Crops Top-1 Top-5
VideoMAE no ViT-S 224x224 16x2x3 66.8 90.3
VideoMAE no ViT-B 224x224 16x2x3 70.8 92.4
VideoMAE no ViT-L 224x224 16x2x3 74.3 94.6
VideoMAE no ViT-L 224x224 32x1x3 75.4 95.2

✨ Kinetics-400

Method Extra Data Backbone Resolution #Frames x Clips x Crops Top-1 Top-5
VideoMAE no ViT-S 224x224 16x5x3 79.0 93.8
VideoMAE no ViT-B 224x224 16x5x3 81.5 95.1
VideoMAE no ViT-L 224x224 16x5x3 85.2 96.8
VideoMAE no ViT-H 224x224 16x5x3 86.6 97.1
VideoMAE no ViT-L 320x320 32x4x3 86.1 97.3
VideoMAE no ViT-H 320x320 32x4x3 87.4 97.6

✨ AVA 2.2

Method Extra Data Extra Label Backbone #Frame x Sample Rate mAP
VideoMAE Kinetics-400 ViT-S 16x4 22.5
VideoMAE Kinetics-400 ViT-S 16x4 28.4
VideoMAE Kinetics-400 ViT-B 16x4 26.7
VideoMAE Kinetics-400 ViT-B 16x4 31.8
VideoMAE Kinetics-400 ViT-L 16x4 34.3
VideoMAE Kinetics-400 ViT-L 16x4 37.0
VideoMAE Kinetics-400 ViT-H 16x4 36.5
VideoMAE Kinetics-400 ViT-H 16x4 39.5
VideoMAE Kinetics-700 ViT-L 16x4 36.1
VideoMAE Kinetics-700 ViT-L 16x4 39.3

✨ UCF101 & HMDB51

Method Extra Data Backbone UCF101 HMDB51
VideoMAE no ViT-B 91.3 62.6
VideoMAE Kinetics-400 ViT-B 96.1 73.3

🔨 Installation

Please follow the instructions in INSTALL.md.

➡️ Data Preparation

Please follow the instructions in DATASET.md for data preparation.

🔄 Pre-training

The pre-training instruction is in PRETRAIN.md.

⤴️ Fine-tuning with pre-trained models

The fine-tuning instruction is in FINETUNE.md.

📍Model Zoo

We provide pre-trained and fine-tuned models in MODEL_ZOO.md.

👀 Visualization

We provide the script for visualization in vis.sh. Colab notebook for better visualization is coming soon.

☎️ Contact

Zhan Tong: [email protected]

👍 Acknowledgements

Thanks to Ziteng Gao, Lei Chen, Chongjian Ge, and Zhiyu Zhao for their kind support.
This project is built upon MAE-pytorch and BEiT. Thanks to the contributors of these great codebases.

🔒 License

The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file. Portions of the project are available under separate license terms: SlowFast and pytorch-image-models are licensed under the Apache 2.0 license. BEiT is licensed under the MIT license.

✏️ Citation

If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:

@inproceedings{tong2022videomae,
  title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}

@article{videomae,
  title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  journal={arXiv preprint arXiv:2203.12602},
  year={2022}
}

About

[NeurIPS 2022] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 92.5%
  • Shell 7.5%