Skip to content

DL-MAE/CatMAE

 
 

Repository files navigation

Concatenated Masked Autoencoders as Spatial-Temporal Learner: A PyTorch Implementation

This is a PyTorch re-implementation of the paper Concatenated Masked Autoencoders as Spatial-Temporal Learner:

Requirements

Data Preparation

We use two datasets, Kinetics-400 and DAVIS-2017, for training and downstream tasks in total.

  • Kinetics-400 used in our experiment comes from here.
  • DAVIS-2017 used in our experiment comes from here

Pre-training

To pre-train Cat-ViT-Small (recommended default), run the following commond:

bash pretrain.sh configs/pretrain.json

pretrain.json (only contains some arguments) :

  • The effective batch size is batch_size (512) * gpus (2) * accum_iter (2) = 2048
  • The effective epochs is epochs (800) * repeated_sampling (2) = 1600
  • The default model is catmae_vit_small (with default patch_size and decoder_dim_dep_head), and for training VIT-B, you can alse change it to catmae_vit_base.
  • Here we use --norm_pix_loss as the target for better representation learning.
  • blr is the base learning rate. The actual lr is computed by the linear scaling rule: lr = blr * effective batch size / 256.

Pre-trained checkpoints

Coming Soon

Video segment in DAVIS-2017

Coming Soon

Action recognition in Kinetics-400

Coming Soon

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.9%
  • Shell 0.1%