- Authors: Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal
- arXiv
# Data & Data Preprocessing
./sevila_data
# Pretrained Checkpoints
./sevila_checkpoints
# SeViLA code
./lavis/
# running scripts for SeViLa localizer/answerer training/inference
./run_scripts
- (Optional) Creating conda environment
conda create -n sevila python=3.8
conda activate sevila
- build from source
pip install -e .
We pre-train SeViLA localizer on QVHighlights and hold checkpoints via Huggingface. Download checkpoints and put it under /sevila_checkpoints. The checkpoints (814.55M) contains pre-trained localizer and zero-shot answerer.
We test our model on:
please download original data and preprocess them via our scripts under ./sevila_data/ .
We provideo SeViLA training and inference script examples as following:
sh run_scripts/sevila/pre-train/pretrain_qvh.sh
sh run_scripts/sevila/refinement/nextqa_sr.sh
sh run_scripts/sevila/finetune/nextqa_ft.sh
sh run_scripts/sevila/inference/nextqa_infer.sh
We thank the developers of LAVIS, BLIP-2, CLIP, All-in-one, for their public code release.
Please cite our paper if you use our models in your works:
@misc{yu2023selfchained,
title={Self-Chained Image-Language Model for Video Localization and Question Answering},
author={Shoubin Yu and Jaemin Cho and Prateek Yadav and Mohit Bansal},
year={2023},
eprint={2305.06988},
archivePrefix={arXiv},
primaryClass={cs.CV}
}