This is a PyTorch re-implementation of the paper Concatenated Masked Autoencoders as Spatial-Temporal Learner:
- pytorch (2.0.1)
timm==0.4.12
- decord
We use two datasets, Kinetics-400 and DAVIS-2017, for training and downstream tasks in total.
- Kinetics-400 used in our experiment comes from here.
- DAVIS-2017 used in our experiment comes from here
The arguments set in the config_file
will be used first
To pre-train CatMAE-ViT-Small, run the following commond:
python main_pretrain.py --config_file configs/pretrain_catmae_vit-s-16.json
Some important arguments
- The
data_path
is /path/to/Kinetics-400/videos_train/ - The effective batch size is
batch_size
(256) * num ofgpus
(4) *accum_iter
(2) = 2048 - The effective epochs is
epochs
(150) *repeated_sampling
(2) = 300 - The default
model
is catmae_vit_small (with default patch_size and decoder_dim_dep_head), and for training VIT-B, you can alse change it to catmae_vit_base. - Here we use
--norm_pix_loss
as the target for better representation learning. blr
is the base learning rate. The actuallr
is computed by the linear scaling rule:lr
=blr
* effective_batch_size / 256.
The following table provides the pre-trained checkpoints used in the paper
ViT/16-Small | ViT/8-Small | |
---|---|---|
pre-trained checkpoint | download | download |
DAVIS 2017 J&Fm | 62.5 | 70.4 |
The Video segment instruction is in DAVIS.md.
The Action recognition instruction is in KINETICS400.md.