ActBERT: Learning Global-Local Video-Text Representations

Zhu, Linchao; Yang, Yi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2011.07231 (cs)

[Submitted on 14 Nov 2020]

Title:ActBERT: Learning Global-Local Video-Text Representations

Authors:Linchao Zhu, Yi Yang

View PDF

Abstract:In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperforms the state-of-the-arts, demonstrating its superiority in video-text representation learning.

Comments:	A few new results are included
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2011.07231 [cs.CV]
	(or arXiv:2011.07231v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2011.07231

Submission history

From: Linchao Zhu [view email]
[v1] Sat, 14 Nov 2020 07:14:08 UTC (1,240 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2020-11

Change to browse by:

References & Citations

1 blog link

(what is this?)

DBLP - CS Bibliography

listing | bibtex

Linchao Zhu
Yi Yang

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:ActBERT: Learning Global-Local Video-Text Representations

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ActBERT: Learning Global-Local Video-Text Representations

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators