Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Feng, Yutong; Gong, Biao; Chen, Di; Shen, Yujun; Liu, Yu; Zhou, Jingren

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.17002 (cs)

[Submitted on 28 Nov 2023 (v1), last revised 9 Apr 2024 (this version, v3)]

Title:Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Authors:Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou

View PDF

Abstract:Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing. Our project page is at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2311.17002 [cs.CV]
	(or arXiv:2311.17002v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.17002

Submission history

From: Yutong Feng [view email]
[v1] Tue, 28 Nov 2023 17:57:44 UTC (33,222 KB)
[v2] Thu, 30 Nov 2023 09:30:19 UTC (33,224 KB)
[v3] Tue, 9 Apr 2024 07:46:43 UTC (29,087 KB)

🚨2024-09-29: arxiv.org is experience DB issues. The announce tonight will be 3 hours later than usual.🚨

Computer Science > Computer Vision and Pattern Recognition

Title:Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

🚨2024-09-29: arxiv.org is experience DB issues. The announce tonight will be 3 hours later than usual.🚨

Computer Science > Computer Vision and Pattern Recognition

Title:Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators