💡 series projects:✨.
[ArXive] STGT: Video-Language Alignment via Spatio–Temporal Graph Transformer;
Shi-Xue Zhang, Hongfa Wang, Xiaobin Zhu, Weibo Gu, Tianjin Zhang, Chun Yang, Wei Liu, Xu-Cheng Yin
- Release eval code
- Release scripts for testing
- Release pre-trained models
- Release fine-tune models on each benchmarks
- Release train codes
- Release train scripts
NOTE:After the paper is accepted, we will open source the code of train.
- Download the corresponding video data through the download_scripts. we have collected.
-
- We provide environment dependencies, first install requirements_all.txt, then install requirements_all-1.txt
pip install -r requirements_all.txt
pip install -r requirements_all-1.txt
-
- You can also run the pip_install.sh directly
sh pip_install.sh
The model link have been shared herer.
Data | W10M+VIDAL4M-256 | W10M+VIDAL4M-1024 | W10M+VIDAL7M-256 | Extracted Code |
---|---|---|---|---|
Pre-training Models | checkpoint_w10m_v4m_256.pth | checkpoint_w10m_v4m_1024.pth | checkpoint_w10m_v7m_256.pth | gxym |
Dataset | model-1 | model-2 | Extracted Code |
---|---|---|---|
didemo_ret | checkpoint_best_w10m_v4m_1024.pth | checkpoint_best_w10m_v7m_256.pth | gxym |
lsmdc_ret | checkpoint_best_w10m_v4m_1024.pth | checkpoint_best_w10m_v4m_256.pth | gxym |
msrvtt_reT | checkpoint_best_w10m_v4m_1024.pth | checkpoint_best_w10m_v7m_256.pth | gxym |
msvd_ret | checkpoint_best_w10m_v4m_1024.pth | checkpoint_best_w10m_v7m_256.pth | gxym |
CLIP VIT pretrained models are here
You can find the corresponding evaluation script here, configure the model path, and run it directly.
DiDemo: eval_didemo_ret_pretrain_vig.sh
LSMDC: eval_lsmdc_ret_pretrain_vig.sh
MSRVTT: eval_msrvtt_ret_pretrain_vig.sh
MSVD: eval_msvd_ret_pretrain_vig.sh
We also provide an ALPRO evaluation scripts, and you can download its model for comparative testing.
NOTE: Due to the desensitization process of the code, we cannot guarantee that there are no bugs in the code, but we will promptly fix these bugs.
This project is licensed under the MIT License - see the LICENSE.md file for details
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.
@article{zhang2024video,
title={Video-Language Alignment Pre-training via Spatio-Temporal Graph Transformer},
author={Zhang, Shi-Xue and Wang, Hongfa and Zhu, Xiaobin and Gu, Weibo and Zhang, Tianjin and Yang, Chun and Liu, Wei and Yin, Xu-Cheng},
journal={arXiv preprint arXiv:2407.11677},
year={2024}
}