Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

Park, Geon Yeong; Kim, Jeongsol; Kim, Beomsu; Lee, Sang Wan; Ye, Jong Chul

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.09869v2 (cs)

[Submitted on 16 Jun 2023 (v1), revised 26 Jun 2023 (this version, v2), latest version 4 Nov 2023 (v3)]

Title:Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

Authors:Geon Yeong Park, Jeongsol Kim, Beomsu Kim, Sang Wan Lee, Jong Chul Ye

View PDF

Abstract:Despite the remarkable performance of text-to-image diffusion models in image generation tasks, recent studies have raised the issue that generated images sometimes cannot capture the intended semantic contents of the text prompts, which phenomenon is often called semantic misalignment. To address this, here we present a novel energy-based model (EBM) framework. Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder. Then, we obtain the gradient of the log posterior of context vectors, which can be updated and transferred to the subsequent cross-attention layer, thereby implicitly minimizing a nested hierarchy of energy functions. Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts. Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing.

Comments:	Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2306.09869 [cs.CV]
	(or arXiv:2306.09869v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.09869

Submission history

From: Jong Chul Ye [view email]
[v1] Fri, 16 Jun 2023 14:30:41 UTC (9,432 KB)
[v2] Mon, 26 Jun 2023 01:03:07 UTC (9,432 KB)
[v3] Sat, 4 Nov 2023 18:18:10 UTC (12,551 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators