Write and Paint: Generative Vision-Language Models are Unified Modal Learners

Diao, Shizhe; Zhou, Wangchunshu; Zhang, Xinsong; Wang, Jiawei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2206.07699 (cs)

[Submitted on 15 Jun 2022 (v1), last revised 17 Feb 2023 (this version, v3)]

Title:Write and Paint: Generative Vision-Language Models are Unified Modal Learners

Authors:Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, Jiawei Wang

View PDF

Abstract:Recent advances in vision-language pre-training have pushed the state-of-the-art on various vision-language tasks, making machines more capable of multi-modal writing (image-to-text generation) and painting (text-to-image generation). However, few studies investigate if these two essential capabilities can be learned together and boost each other, making a versatile and powerful multi-modal foundation model. In this work, we disclose the potential of symmetric generative vision-language pre-training in learning to write and paint concurrently, and propose a new unified modal model, named DaVinci, trained with prefix language modeling and prefix image modeling, a simple generative self-supervised objective on image-text pairs. Thanks to the proposed prefix multi-modal modeling framework, DaVinci is simple to train, scalable to huge data, adaptable to both writing and painting tasks, and also strong on other vision, text, and multi-modal understanding tasks. DaVinci achieves competitive performance on a wide range of 27 generation/understanding tasks and demonstrates the superiority of combining vision/language generative pre-training. Furthermore, we carefully benchmark the performance of different vision-language pre-training objectives on different scales of pre-training datasets on a heterogeneous and broad distribution coverage. Our results demonstrate the potential of exploiting self-supervision in both language and vision inputs, and establish new, stronger baselines for future comparisons at different data scales. The code and pre-trained models are available at this https URL.

Comments:	ICLR 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2206.07699 [cs.CV]
	(or arXiv:2206.07699v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2206.07699

Submission history

From: Shizhe Diao [view email]
[v1] Wed, 15 Jun 2022 17:49:38 UTC (1,147 KB)
[v2] Thu, 16 Feb 2023 17:01:44 UTC (1,292 KB)
[v3] Fri, 17 Feb 2023 02:58:03 UTC (1,292 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Write and Paint: Generative Vision-Language Models are Unified Modal Learners

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Write and Paint: Generative Vision-Language Models are Unified Modal Learners

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators