Skip to content
forked from Yui010206/SeViLA

Self-Chaining Image-Language Model for Temporal Localization and Question Answering on Videos

License

Notifications You must be signed in to change notification settings

eltociear/SeViLA

 
 

Repository files navigation

Self-Chained Image-Language Model for Video Localization and Question Answering

teaser image

teaser image

teaser image

Code structure

# Data & Data Preprocessing
./sevila_data

# Pretrained Checkpoints
./sevila_checkpoints

# SeViLA code
./lavis/

# running scripts for SeViLa localizer/answerer training/inference
./run_scripts

Setup

Install Dependencies

  1. (Optional) Creating conda environment
conda create -n sevila python=3.8
conda activate sevila
  1. build from source
pip install -e .

Download Pretrained Models

We pre-train SeViLA localizer on QVHighlights and hold checkpoints via Huggingface. Download checkpoints and put it under /sevila_checkpoints. The checkpoints (814.55M) contains pre-trained localizer and zero-shot answerer.

Dataset Preparation

We test our model on:

please download original data and preprocess them via our scripts under ./sevila_data/ .

Training and Inference

We provideo SeViLA training and inference script examples as following:

1) Localizer Pre-training

sh run_scripts/sevila/pre-train/pretrain_qvh.sh

2) Localizer Self-refinement

sh run_scripts/sevila/refinement/nextqa_sr.sh

3) Answerer Fine-tuning

sh run_scripts/sevila/finetune/nextqa_ft.sh

4) Inference

sh run_scripts/sevila/inference/nextqa_infer.sh

Acknowledgments

We thank the developers of LAVIS, BLIP-2, CLIP, All-in-one, for their public code release.

Reference

Please cite our paper if you use our models in your works:

@misc{yu2023selfchained,
      title={Self-Chained Image-Language Model for Video Localization and Question Answering}, 
      author={Shoubin Yu and Jaemin Cho and Prateek Yadav and Mohit Bansal},
      year={2023},
      eprint={2305.06988},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

About

Self-Chaining Image-Language Model for Temporal Localization and Question Answering on Videos

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.9%
  • Jupyter Notebook 1.9%
  • Shell 0.2%