This is a PyTorch re-implementation of the paper Concatenated Masked Autoencoders as Spatial-Temporal Learner:
- pytorch (2.0.1)
timm==0.4.12
- decord
We use two datasets, Kinetics-400 and DAVIS-2017, for training and downstream tasks in total.
- Kinetics-400 used in our experiment comes from here.
- DAVIS-2017 used in our experiment comes from here
To pre-train Cat-ViT-Small (recommended default), run the following commond:
bash pretrain.sh configs/pretrain.json
pretrain.json (only contains some arguments) :
- The effective batch size is
batch_size
(512) *gpus
(2) *accum_iter
(2) = 2048 - The effective epochs is
epochs
(800) *repeated_sampling
(2) = 1600 - The default
model
is catmae_vit_small (with default patch_size and decoder_dim_dep_head), and for training VIT-B, you can alse change it to catmae_vit_base. - Here we use
--norm_pix_loss
as the target for better representation learning. blr
is the base learning rate. The actuallr
is computed by the linear scaling rule:lr
=blr
* effective batch size / 256.