M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis

Zhang, Zhu; Ma, Jianxin; Zhou, Chang; Men, Rui; Li, Zhikang; Ding, Ming; Tang, Jie; Zhou, Jingren; Yang, Hongxia

Computer Science > Computer Vision and Pattern Recognition

arXiv:2105.14211v3 (cs)

[Submitted on 29 May 2021 (v1), revised 26 Nov 2021 (this version, v3), latest version 19 Feb 2022 (v4)]

Title:M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis

Authors:Zhu Zhang, Jianxin Ma, Chang Zhou, Rui Men, Zhikang Li, Ming Ding, Jie Tang, Jingren Zhou, Hongxia Yang

View PDF

Abstract:Conditional image synthesis aims to create an image according to some multi-modal guidance in the forms of textual descriptions, reference images, and image blocks to preserve, as well as their combinations. In this paper, instead of investigating these control signals separately, we propose a new two-stage architecture, M6-UFC, to unify any number of multi-modal controls. In M6-UFC, both the diverse control signals and the synthesized image are uniformly represented as a sequence of discrete tokens to be processed by Transformer. Different from existing two-stage autoregressive approaches such as DALL-E and VQGAN, M6-UFC adopts non-autoregressive generation (NAR) at the second stage to enhance the holistic consistency of the synthesized image, to support preserving specified image blocks, and to improve the synthesis speed. Further, we design a progressive algorithm that iteratively improves the non-autoregressively generated image, with the help of two estimators developed for evaluating the compliance with the controls and evaluating the fidelity of the synthesized image, respectively. Extensive experiments on a newly collected large-scale clothing dataset M2C-Fashion and a facial dataset Multi-Modal CelebA-HQ verify that M6-UFC can synthesize high-fidelity images that comply with flexible multi-modal controls.

Comments:	Accepted by NeurIPS21
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2105.14211 [cs.CV]
	(or arXiv:2105.14211v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2105.14211

Submission history

From: Zhu Zhang [view email]
[v1] Sat, 29 May 2021 04:42:07 UTC (8,785 KB)
[v2] Wed, 18 Aug 2021 09:55:00 UTC (23,110 KB)
[v3] Fri, 26 Nov 2021 13:43:04 UTC (23,110 KB)
[v4] Sat, 19 Feb 2022 17:12:14 UTC (23,110 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators