CAST: Cross-Attention in Space and Time for Video Action Recognition [NeurIPS 2023][Project Page][Arxiv]

CAST Framework


🔧 Installation

We conduct all the experiments with 16 NVIDIA GeForce RTX 3090 GPUs. First, install PyTorch 1.10.0+ and torchvision 0.11.0.

conda create -n vmae_1.10  python=3.8 ipykernel -y
conda activate vmae_1.10
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 -c pytorch

Then, install timm, triton, DeepSpeed, and others.

pip install triton==1.0.0
git clone
cd DeepSpeed
git checkout 3a3dfe66bb
DS_BUILD_OPS=1 pip install . --global-option="build_ext"
pip install TensorboardX decord einops scipy pandas requests

If you have successfully installed Deepspeed, after running the 'ds_report' command, you can see the following results. For other Deepspeed-related issues, please refer to the DeepSpeed GitHub page.


📁 Data Preparation


  • The pre-processing of EPIC-KITCHENS-100 can be summarized into 3 steps:

    1. Download the dataset from official website.

    2. Preprocess the dataset by resizing the short edge of video to 256px. You can refer to MMAction2 Data Benchmark.

    3. Generate annotations needed for dataloader ("<video_id>,<verb_class>,<noun_class>" in annotations). The annotation usually includes train.csv, val.csv. The format of *.csv file is like:

    4. All video files are located inside the DATA_PATH.


  • The pre-processing of Something-Something-V2 can be summarized into 3 steps:

    1. Download the dataset from official website.

    2. Preprocess the dataset by changing the video extension from webm to .mp4 with the original height of 240px. You can refer to MMAction2 Data Benchmark.

    3. Generate annotations needed for dataloader ("<video_id> <video_class>" in annotations). The annotation usually includes train.csv, val.csv and test.csv. The format of *.csv file is like:

      video_1.mp4  label_1
      video_2.mp4  label_2
      video_3.mp4  label_3
      video_N.mp4  label_N
    4. All video files are located inside the DATA_PATH.


  • The pre-processing of Kinetics400 can be summarized into 3 steps:

    1. Download the dataset from official website or OpenDataLab.

    2. Preprocess the dataset by resizing the short edge of video to 320px. You can refer to MMAction2 Data Benchmark.

    3. Generate annotations needed for dataloader ("<video_id> <video_class>" in annotations). The annotation usually includes train.csv, val.csv and test.csv. The format of *.csv file is like:

      video_1.mp4  label_1
      video_2.mp4  label_2
      video_3.mp4  label_3
      video_N.mp4  label_N

  1. All video files should be splited into DATA_PATH/train and DATA_PATH/val.

Expert model preparation

We use the pre-trained weights of spatial and temporal experts. The pretrained weight of the spatial expert (CLIP) uses the official weight. The pre-trained weight of the temporal expert (VideoMAE) uses the pre-trained weights from the three datasets EK100, K400, and SSV2. Of these, K400 and SSV2 use the official weights, and EK100 uses the weights we pre-trained ourselves. Put each downloaded expert weight into the VMAE_PATH and CLIP_PATH of the fine-tune script.

Fine-tuning CAST

We provide the off-the-shelf scripts in the scripts folder.

  • For example, to fine-tune CAST on Kinetics400 with 16 GPUs (2 nodes x 8 GPUs) script.

OMP_NUM_THREADS=1 python -m torch.distributed.launch \
  --nproc_per_node=2 \
  --master_port ${YOUR_NUMBER} --nnodes=8 \
  --node_rank=${YOUR_NUMBER} --master_addr=${YOUR_NUMBER} \
  --data_set Kinetics-400 \
  --nb_classes 400 \
  --vmae_model compo_bidir_vit_base_patch16_224 \
  --anno_path ${ANNOTATION_PATH}
  --data_path ${DATA_PATH} \
  --clip_finetune ${CLIP_MODEL_PATH} \
  --vmae_finetune ${VMAE_MODEL_PATH} \
  --log_dir ${YOUR_PATH} \
  --output_dir ${YOUR_PATH} \
  --batch_size 6 \
  --input_size 224 \
  --short_side_size 224 \
  --save_ckpt_freq 25 \
  --num_sample 1 \
  --num_frames 16 \
  --opt adamw \
  --lr 1e-3 \
  --opt_betas 0.9 0.999 \
  --weight_decay 0.05 \
  --epochs 70 \
  --dist_eval \
  --test_num_segment 5 \
  --test_num_crop 3 \
  --num_workers 8 \
  --drop_path 0.2 \
  --layer_decay 0.75 \
  --mixup_switch_prob 0 \
  --mixup_prob 0.5 \
  --reprob 0. \
  --init_scale 1. \
  --update_freq 6 \
  --seed 0 \
  --enable_deepspeed \
  --warmup_epochs 5 \


Evaluation commands for the EK100.

python ./ --fine_tune {YOUR_FINETUNED_WEIGHT} --composition --eval

Evaluation commands for the SSV2, K400.

python ./ --fine_tune {YOUR_FINETUNED_WEIGHT} --eval

Model Zoo


Method Spatial Expert Temporal expert Epoch