Skip to content

CSC 594: Human-like Visual Question Answering with Multimodal Transformers

Notifications You must be signed in to change notification settings

erikmcguire/csc594_lxmert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSC 594: Human-like Visual Question Answering with Multimodal Transformers

A class project for DePaul University, Fall 2020. Intended, as usual, for learning purposes: in this case to explore multimodality, multi-head self-attention, human data, et cetera. Please forgive any mistakes and misconceptions.

Abstract

Recently, research has been focusing on multimodal models which fuse image and language data to ground the learning of representations. One popular multimodal task is Visual Question Answering (VQA), which requires choosing the correct answer given an image and a question. In addition, datasets such as VQA-HAT (Human ATtention) enable researchers to study where human subjects attend to images when completing the VQA task. These data can also be used to supervise attention, inducing human biases in how machines attend to the same image-question pairs for the VQA task. In this work, we investigate the attention supervision of a multimodal transformer model, LXMERT, specifically its cross-modal attentions. We study the performance of the supervised model and compare the human and machine attentions. We find that performance is maintained despite successfully influencing the model to attend in a more human-like manner.

About

CSC 594: Human-like Visual Question Answering with Multimodal Transformers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages