Skip to content
/ STGT Public

Video-Language Alignment via Spatio–Temporal Graph Transformer; ArXiv: https://arxiv.org/abs/2407.11677

Notifications You must be signed in to change notification settings

GXYM/STGT

Repository files navigation

If you like our project, please give us a star ⭐ on GitHub for latest update.

arXiv GitHub issues GitHub closed issues License

💡 series projects:✨.

[ArXive] STGT: Video-Language Alignment via Spatio–Temporal Graph Transformer
Shi-Xue Zhang, Hongfa Wang, Xiaobin Zhu, Weibo Gu, Tianjin Zhang, Chun Yang, Wei Liu, Xu-Cheng Yin
arXiv GitHub issues GitHub closed issues License

1. ToDo

  • Release eval code
  • Release scripts for testing
  • Release pre-trained models
  • Release fine-tune models on each benchmarks
  • Release train codes
  • Release train scripts

NOTE:After the paper is accepted, we will open source the code of train.

2. Prepare Dataset

  1. Download the corresponding video data through the download_scripts. we have collected.

3. Environment

pip install -r requirements_all.txt
pip install -r requirements_all-1.txt
sh pip_install.sh

4. DownLoad Models

The model link have been shared herer.

Data W10M+VIDAL4M-256 W10M+VIDAL4M-1024 W10M+VIDAL7M-256 Extracted Code
Pre-training Models checkpoint_w10m_v4m_256.pth checkpoint_w10m_v4m_1024.pth checkpoint_w10m_v7m_256.pth gxym
Dataset model-1 model-2 Extracted Code
didemo_ret checkpoint_best_w10m_v4m_1024.pth checkpoint_best_w10m_v7m_256.pth gxym
lsmdc_ret checkpoint_best_w10m_v4m_1024.pth checkpoint_best_w10m_v4m_256.pth gxym
msrvtt_reT checkpoint_best_w10m_v4m_1024.pth checkpoint_best_w10m_v7m_256.pth gxym
msvd_ret checkpoint_best_w10m_v4m_1024.pth checkpoint_best_w10m_v7m_256.pth gxym

CLIP VIT pretrained models are here

5.Eval and Testing

You can find the corresponding evaluation script here, configure the model path, and run it directly.

DiDemo:  eval_didemo_ret_pretrain_vig.sh
LSMDC:   eval_lsmdc_ret_pretrain_vig.sh
MSRVTT:  eval_msrvtt_ret_pretrain_vig.sh
MSVD:    eval_msvd_ret_pretrain_vig.sh

We also provide an ALPRO evaluation scripts, and you can download its model for comparative testing.

NOTE: Due to the desensitization process of the code, we cannot guarantee that there are no bugs in the code, but we will promptly fix these bugs.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.

@article{zhang2024video,
  title={Video-Language Alignment Pre-training via Spatio-Temporal Graph Transformer},
  author={Zhang, Shi-Xue and Wang, Hongfa and Zhu, Xiaobin and Gu, Weibo and Zhang, Tianjin and Yang, Chun and Liu, Wei and Yin, Xu-Cheng},
  journal={arXiv preprint arXiv:2407.11677},
  year={2024}
}

✨ Star History

Star History

About

Video-Language Alignment via Spatio–Temporal Graph Transformer; ArXiv: https://arxiv.org/abs/2407.11677

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published