GitHub - GXYM/STGT: Video-Language Alignment via Spatio–Temporal Graph Transformer; ArXiv: https://arxiv.org/abs/2407.11677

Video-Language Alignment via Spatio–Temporal Graph Transformer (STGT)

If you like our project, please give us a star ⭐ on GitHub for latest update.

💡 series projects:✨.

[ArXive] STGT: Video-Language Alignment via Spatio–Temporal Graph Transformer；
Shi-Xue Zhang, Hongfa Wang, Xiaobin Zhu, Weibo Gu, Tianjin Zhang, Chun Yang, Wei Liu, Xu-Cheng Yin

1. ToDo

NOTE：After the paper is accepted, we will open source the code of train.

2. Prepare Dataset

Download the corresponding video data through the download_scripts. we have collected.

3. Environment

1. We provide environment dependencies, first install requirements_all.txt, then install requirements_all-1.txt

pip install -r requirements_all.txt
pip install -r requirements_all-1.txt

1. You can also run the pip_install.sh directly

sh pip_install.sh

4. DownLoad Models

The model link have been shared herer.

Data	W10M+VIDAL4M-256	W10M+VIDAL4M-1024	W10M+VIDAL7M-256	Extracted Code
Pre-training Models	checkpoint_w10m_v4m_256.pth	checkpoint_w10m_v4m_1024.pth	checkpoint_w10m_v7m_256.pth	gxym

Dataset	model-1	model-2	Extracted Code
didemo_ret	checkpoint_best_w10m_v4m_1024.pth	checkpoint_best_w10m_v7m_256.pth	gxym
lsmdc_ret	checkpoint_best_w10m_v4m_1024.pth	checkpoint_best_w10m_v4m_256.pth	gxym
msrvtt_reT	checkpoint_best_w10m_v4m_1024.pth	checkpoint_best_w10m_v7m_256.pth	gxym
msvd_ret	checkpoint_best_w10m_v4m_1024.pth	checkpoint_best_w10m_v7m_256.pth	gxym

CLIP VIT pretrained models are here

5.Eval and Testing

You can find the corresponding evaluation script here, configure the model path, and run it directly.

DiDemo:  eval_didemo_ret_pretrain_vig.sh
LSMDC:   eval_lsmdc_ret_pretrain_vig.sh
MSRVTT:  eval_msrvtt_ret_pretrain_vig.sh
MSVD:    eval_msvd_ret_pretrain_vig.sh

We also provide an ALPRO evaluation scripts, and you can download its model for comparative testing.

NOTE: Due to the desensitization process of the code, we cannot guarantee that there are no bugs in the code, but we will promptly fix these bugs.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.

@article{zhang2024video,
  title={Video-Language Alignment Pre-training via Spatio-Temporal Graph Transformer},
  author={Zhang, Shi-Xue and Wang, Hongfa and Zhu, Xiaobin and Gu, Weibo and Zhang, Tianjin and Yang, Chun and Liu, Wei and Yin, Xu-Cheng},
  journal={arXiv preprint arXiv:2407.11677},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
download_scripts		download_scripts
lavis		lavis
run_scripts		run_scripts
README.md		README.md
framework.png		framework.png
pip_install.sh		pip_install.sh
requirements_all-1.txt		requirements_all-1.txt
requirements_all.txt		requirements_all.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video-Language Alignment via Spatio–Temporal Graph Transformer (STGT)

If you like our project, please give us a star ⭐ on GitHub for latest update.

1. ToDo

2. Prepare Dataset

3. Environment

4. DownLoad Models

5.Eval and Testing

License

✏️ Citation

✨ Star History

About

Releases

Packages

Languages

GXYM/STGT

Folders and files

Latest commit

History

Repository files navigation

Video-Language Alignment via Spatio–Temporal Graph Transformer (STGT)

If you like our project, please give us a star ⭐ on GitHub for latest update.

1. ToDo

2. Prepare Dataset

3. Environment

4. DownLoad Models

5.Eval and Testing

License

✏️ Citation

✨ Star History

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages