GitHub

Concatenated Masked Autoencoders as Spatial-Temporal Learner: A PyTorch Implementation

This is a PyTorch re-implementation of the paper Concatenated Masked Autoencoders as Spatial-Temporal Learner:

Requirements

pytorch (2.0.1)
timm==0.4.12
decord

Data Preparation

We use two datasets, Kinetics-400 and DAVIS-2017, for training and downstream tasks in total.

Kinetics-400 used in our experiment comes from here.
DAVIS-2017 used in our experiment comes from here

Pre-training

The arguments set in the config_file will be used first

To pre-train CatMAE-ViT-Small, run the following commond:

python main_pretrain.py --config_file configs/pretrain_catmae_vit-s-16.json

Some important arguments

The data_path is /path/to/Kinetics-400/videos_train/
The effective batch size is batch_size (256) * num of gpus (4) * accum_iter (2) = 2048
The effective epochs is epochs (150) * repeated_sampling （2） = 300
The default model is catmae_vit_small (with default patch_size and decoder_dim_dep_head), and for training VIT-B, you can alse change it to catmae_vit_base.
Here we use --norm_pix_loss as the target for better representation learning.
blr is the base learning rate. The actual lr is computed by the linear scaling rule: lr = blr * effective_batch_size / 256.

Pre-trained checkpoints

The following table provides the pre-trained checkpoints used in the paper

	ViT/16-Small	ViT/8-Small
pre-trained checkpoint	download	download
DAVIS 2017 J&Fm	62.5	70.4

Video segment in DAVIS-2017

The Video segment instruction is in DAVIS.md.

Action recognition in Kinetics-400

The Action recognition instruction is in KINETICS400.md.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
downstream		downstream
figures		figures
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
engine_pretrain.py		engine_pretrain.py
kinetics_dataset.py		kinetics_dataset.py
main_pretrain.py		main_pretrain.py
models_catmae.py		models_catmae.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Concatenated Masked Autoencoders as Spatial-Temporal Learner: A PyTorch Implementation

Requirements

Data Preparation

Pre-training

Pre-trained checkpoints

Video segment in DAVIS-2017

Action recognition in Kinetics-400

About

Releases

Packages

Languages

License

minhoooo1/CatMAE

Folders and files

Latest commit

History

Repository files navigation

Concatenated Masked Autoencoders as Spatial-Temporal Learner: A PyTorch Implementation

Requirements

Data Preparation

Pre-training

Pre-trained checkpoints

Video segment in DAVIS-2017

Action recognition in Kinetics-400

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages