Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Anderson, Peter; He, Xiaodong; Buehler, Chris; Teney, Damien; Johnson, Mark; Gould, Stephen; Zhang, Lei

Computer Science > Computer Vision and Pattern Recognition

arXiv:1707.07998 (cs)

[Submitted on 25 Jul 2017 (v1), last revised 14 Mar 2018 (this version, v3)]

Title:Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Authors:Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang

View PDF

Abstract:Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

Comments:	CVPR 2018 full oral, winner of the 2017 Visual Question Answering challenge
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:1707.07998 [cs.CV]
	(or arXiv:1707.07998v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1707.07998

Submission history

From: Peter Anderson [view email]
[v1] Tue, 25 Jul 2017 13:50:17 UTC (7,770 KB)
[v2] Thu, 10 Aug 2017 23:24:23 UTC (8,777 KB)
[v3] Wed, 14 Mar 2018 05:24:23 UTC (8,777 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2017-07

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson

…

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators