LAVIS/projects/img2llm-vqa at main · salesforce/LAVIS

History

Name		Name	Last commit message	Last commit date
parent directory ..
Caption.png		Caption.png
Illustration.png		Illustration.png
QuestionGeneration.png		QuestionGeneration.png
README.md		README.md
demo.png		demo.png
img2llm_vqa.ipynb		img2llm_vqa.ipynb
img2llm_vqa.py		img2llm_vqa.py

README.md

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

This is the official code for Img2LLM-VQA paper. We integrate the implementation into LAVIS.

Large language models (LLMs) have demonstrated excellent zero-shot generalization to new tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. We propose Img2LLM, a plug-and-play module that provides the prompts that can bridge the these disconnections, so that LLMs can perform VQA tasks without end-to-end training.

The following images illustrate the technical procedures of answer extraction, question generation and caption prompts in Img2LLM. See paper for details.

Demo

We include an interactive demo Colab notebook to show Img2LLM-VQA inference workflow:

Image-question matching: compute the relevancy score of the image patches wrt the question, and remove the generated noisy captions with low relevancy score.
Image captioning: generate question-guided captions based on the relevancy score.
Question Generation: generate questions based on the synthetic answers and captions.
Large Language Model: Pre-trained lagre language models, e.g. OPT/GPT-3

Zero-Shot Evaluation

Model	End-to-End Training?	VQAv2 val	VQAv2 test	OK-VQA test	AOK-VQA val	AOK-VQA test
Model	End-to-End Training?	Frozen-7B_base	✓	29.5	-	5.9	-	-
Flamingo-9B_base	✓	-	51.8	44.7	-	-
Flamingo-80B_base	✓	-	56.3	50.6	-	-
Img2LLM-VQA-OPT_13B	x	57.1	57.3	39.9	33.3	33.0
Img2LLM-VQA-OPT_30B	x	59.5	60.4	41.8	36.9	36.0
Img2LLM-VQA-OPT_66B	x	59.9	60.3	43.2	38.7	38.2
Img2LLM-VQA-OPT_175B	x	60.6	61.9	45.6	42.9	40.7

To reproduce these evaluation results of Img2LLM-VQA with different LLMs, you can follow this folder :

Citation

If you find this code to be useful for your research, please consider citing.

@misc{guo2023from,
  title={From Images to Textual Prompts: Zero-shot {VQA} with Frozen Large Language Models},
  author={Jiaxian Guo and Junnan Li and Dongxu Li and Anthony Tiong and Boyang Li and Dacheng Tao and Steven HOI},
  year={2023},
  url={https://openreview.net/forum?id=Ck1UtnVukP8}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img2llm-vqa

img2llm-vqa

README.md

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

Demo

Zero-Shot Evaluation

Citation

Files

img2llm-vqa

Directory actions

More options

Directory actions

More options

Latest commit

History

img2llm-vqa

Folders and files

parent directory

README.md

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

Demo

Zero-Shot Evaluation

Citation