CSC 594: Human-like Visual Question Answering with Multimodal Transformers

A class project for DePaul University, Fall 2020. Intended, as usual, for learning purposes: in this case to explore multimodality, multi-head self-attention, human data, et cetera. Please forgive any mistakes and misconceptions.

Abstract

Recently, research has been focusing on multimodal models which fuse image and language data to ground the learning of representations. One popular multimodal task is Visual Question Answering (VQA), which requires choosing the correct answer given an image and a question. In addition, datasets such as VQA-HAT (Human ATtention) enable researchers to study where human subjects attend to images when completing the VQA task. These data can also be used to supervise attention, inducing human biases in how machines attend to the same image-question pairs for the VQA task. In this work, we investigate the attention supervision of a multimodal transformer model, LXMERT, specifically its cross-modal attentions. We study the performance of the supervised model and compare the human and machine attentions. We find that performance is maintained despite successfully influencing the model to attend in a more human-like manner.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src		src
README.md		README.md
_config.yml		_config.yml
csc594-mcguire_erik-report.pdf		csc594-mcguire_erik-report.pdf
index.md		index.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSC 594: Human-like Visual Question Answering with Multimodal Transformers

Abstract

About

Releases

Packages

Languages

erikmcguire/csc594_lxmert

Folders and files

Latest commit

History

Repository files navigation

CSC 594: Human-like Visual Question Answering with Multimodal Transformers

Abstract

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages