This repo contains the supported code to reproduce spatio-temporal action detection results of TubeR: Tubelet Transformer for Video Action Detection.
08/08/2022 Initial commits
Backbone | Pretrain | #view | mAP | FLOPs | config | model |
---|---|---|---|---|---|---|
CSN-50 | Kinetics-400 | 1 view | 27.2 | 78G | config | S3 |
CSN-50 (with long-term context) | Kinetics-400 | 1 view | 28.8 | 78G | config | Comming soon |
CSN-152 | Kinetics-400+IG65M | 1 view | 29.7 | 120G | config | S3 |
CSN-152 (with long-term context) | Kinetics-400+IG65M | 1 view | 31.7 | 120G | config | Comming soon |
Backbone | Pretrain | #view | mAP | FLOPs | config | model |
---|---|---|---|---|---|---|
CSN-152 | Kinetics-400+IG65M | 1 view | 31.1 | 120G | config | S3 |
CSN-152 (with long-term context) | Kinetics-400+IG65M | 1 view | 33.4 | 120G | config | Comming soon |
Backbone | #view | [email protected] | [email protected] | config | model |
---|---|---|---|---|---|
CSN-152 | 1 view | 87.4 | 82.3 | config | S3 |
The project is developed based on GluonCV-torch. Please refer to tutorial for details.
The project is tested working on:
- Torch 1.12 + CUDA 11.3
- timm==0.4.5
- tensorboardX
Please download the asset.zip and unzip them at ./datasets.
[AVA] Please refer to DATASET.md for AVA dataset downloading and pre-processing. [JHMDB] Please refer to JHMDB for JHMDB dataset and Dataset Section for UCF dataset. You also can refer to ACT-Detector to prepare the two datasets.
To run inference, first modify the config file:
- set the correct
WORLD_SIZE
,GPU_WORLD_SIZE
,DIST_URL
,WOLRD_URLS
based on experiment setup. - set the
LABEL_PATH
,ANNO_PATH
,DATA_PATH
to your local directory accordingly. - Download the pre-trained model and set
PRETRAINED_PATH
to model path. - make sure
LOAD
andLOAD_FC
are set to True
Then run:
# run testing
python3 eval_tuber_ava.py <CONFIG_FILE>
# for example, to evaluate ava from scratch, run:
python3 eval_tuber_ava.py configuration/TubeR_CSN152_AVA21.yaml
To train TubeR from scratch, first modify the configfile:
- set the correct
WORLD_SIZE
,GPU_WORLD_SIZE
,DIST_URL
,WOLRD_URLS
based on experiment setup. - set the
LABEL_PATH
,ANNO_PATH
,DATA_PATH
to your local directory accordingly. - Download the pre-trained feature backbone and transformer weights and set
PRETRAIN_BACKBONE_DIR
(CSN50, CSN152),PRETRAIN_TRANSFORMER_DIR
(DETR) accordingly. - make sure
LOAD
andLOAD_FC
are set to False
Then run:
# run training from scratch
python3 train_tuber.py <CONFIG_FILE>
# for example, to train ava from scratch, run:
python3 train_tuber_ava.py configuration/TubeR_CSN152_AVA21.yaml
[ ]Add tutorial and pre-trained weights for TubeR with long-term memory
[ ]Add weights for UCF24
@inproceedings{zhao2022tuber,
title={TubeR: Tubelet transformer for video action detection},
author={Zhao, Jiaojiao and Zhang, Yanyi and Li, Xinyu and Chen, Hao and Shuai, Bing and Xu, Mingze and Liu, Chunhui and Kundu, Kaustav and Xiong, Yuanjun and Modolo, Davide and others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={13598--13607},
year={2022}
}