This project hosts the tensorflow implementation for our ECCV 2018 paper, A Joint Sequence Fusion Model for Video Question Answering and Retrieval}.
If you use this code or dataset as part of any published research, please refer the following paper.
@inproceedings{
author = {Youngjae Yu and Jongseok Kim and Gunhee Kim},
title = "{A Joint Sequence Fusion Model for Video Question Answering and Retrieval}"
booktitle = {ECCV},
year = 2018
}
pip install -r requirements.txt
git submodule update --init --recursive
add2virtualenv .
-
Video Feature
-
Download LSMDC data.
-
Extract rgb features using pool5 layer of the pretrained ResNet-152 model.
-
Extract audio features using VGGish.
-
Concat rgb and video features and save it into hdf5 file, and save it in 'dataset/LSMDC/LSMDC16_features/RESNET_pool5wav.hdf5'.
-
-
Dataset
- We processed raw data frames file in LSMDC17 and MSR-VTT dataset
- Download dataframe files
- Save these files in "dataset/LSMDC/DataFrame"
-
Vocabulary
- We make word embedding matrix using GloVe Vector.
- Download vocabulary files
- Save these files in "dataset/LSMDC/Vocabulary"
Modify configuartion.py
to suit your environment.
- train_tag can be 'MC', 'FIB'
Run train.py
.
python train.py --tag="tag"
You can download the models and features in gDrive Link Modify 'configuration.py' to load the checkpoints (self.load_from_ckpt = 'path/to/checkpoint/')
[RET] R@1: 93, R@5: 247, R@10: 348, medr : 29
[FIB] Accuracy: 45.1
You can get slightly lower or higher performance from these scores.