OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Wang, Junke; Chen, Dongdong; Wu, Zuxuan; Luo, Chong; Zhou, Luowei; Zhao, Yucheng; Xie, Yujia; Liu, Ce; Jiang, Yu-Gang; Yuan, Lu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2209.07526 (cs)

[Submitted on 15 Sep 2022 (v1), last revised 19 Oct 2022 (this version, v2)]

Title:OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Authors:Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan

View PDF

Abstract:This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.

Comments:	To appear at NeurIPs 2022, Camera Ready with Typos fixed
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2209.07526 [cs.CV]
	(or arXiv:2209.07526v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2209.07526

Submission history

From: Dongdong Chen [view email]
[v1] Thu, 15 Sep 2022 17:59:59 UTC (2,821 KB)
[v2] Wed, 19 Oct 2022 21:03:30 UTC (2,828 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Submission history

Access Paper:

References & Citations

1 blog link

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Submission history

Access Paper:

References & Citations

1 blog link

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators