Official implementation of the Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering (IJCAI 2023 paper)
paper | arxiv | project page
-
Use python >= 3.8.5. Conda recommended : https://docs.anaconda.com/anaconda/install/linux/
-
Use pytorch 1.9.0; CUDA 11.1
To setup environment
conda env create -n retvqa --file retvqa.yml
conda activate retvqa
Images: Visual Genome Krishna et al.
RetVQA: here
Image Feature extraction: Inside feature_extraction/
, run
python retqa_proposal.py
Refer to this repo for more detailed set-up of Faster R-CNN feature extractor.
Refer to COFAR. ITM only variant serves as relevance encoder, when pre-trained on COCO and finetuned on RetVQA.
bash retvqa_VLBart.sh <no. of GPU>
bash retvqa_VLBart_test.sh <no. of GPU>
bash retvqa_retrieved_VLBart_test.sh <no. of GPU>
Download our retvqa finetuned checkpoint from here.
python eval_retvqa.py --gt_file <ground truth answers file path> --results_file <path to the generated answers>
This code and data are released under the MIT license.
If you find this data/code/paper useful for your research, please consider citing.
@inproceedings{retvqa,
author = {Abhirama Subramanyam Penamakuri and
Manish Gupta and
Mithun Das Gupta and
Anand Mishra},
title = {Answer Mining from a Pool of Images: Towards Retrieval-Based Visual
Question Answering},
booktitle = {IJCAI},
publisher = {ijcai.org},
year = {2023},
url = {https://doi.org/10.24963/ijcai.2023/146},
doi = {10.24963/ijcai.2023/146},
}
- We used code-base and pre-trained models of VLBart.
- Abhirama S. Penamakuri is supported by Prime Minister Research Fellowship (PMRF), Minsitry of Education, Government of India.
- We thank Microsoft for supporting this work through the Microsoft Academic Partnership Grant (MAPG) 2021.