Skip to content

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model, 2022

License

Notifications You must be signed in to change notification settings

atveit/Versatile-Diffusion

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Versatile Diffusion

Huggingface space Framework: PyTorch License: MIT

This repo hosts the official implementary of:

Xingqian Xu, Atlas Wang, Eric Zhang, Kai Wang, and Humphrey Shi, Versatile Diffusion: Text, Images and Variations All in One Diffusion Model, Paper arXiv Link.

News

  • [2022.11.16]: Our demo is up and running on 🤗HuggingFace!
  • [2022.11.14]: Part of our evaluation code and models are released!
  • [2022.11.12]: Repo initiated

Introduction

We built Versatile Diffusion (VD), the first unified multi-flow multimodal diffusion framework, as a step towards Universal Generative AI. Versatile Diffusion can natively support image-to-text, image-variation, text-to-image, and text-variation, and can be further extended to other applications such as semantic-style disentanglement, image-text dual-guided generation, latent image-to-text-to-image editing, and more. Future versions will support more modalities such as speech, music, video and 3D.

Network and Framework

One single flow of Versatile Diffusion contains a VAE, a diffuser, and a context encoder, and thus handles one task (e.g., text-to-image) under one data type (e.g., image) and one context type (e.g., text). The multi-flow structure of Versatile Diffusion shows in the following diagram:

According to Versatile Diffusion, we further proposed a generalized multi-flow multimodal framework with VAEs, context encoders, and diffusers containing three layers (i.e., global, data, and context layers). To involve a new multimodal task in this framework, we bring out the following requirements:

  • The design of the core diffuser should contain shared global layers, swappable data, and context layers that will be correspondingly activated based on data and context types.
  • The choice of VAEs should smoothly map data onto highly interpretable latent spaces.
  • The choice of context encoders should jointly minimize the cross-modal statistical distance on all supported content types.

Performance

Data

We use Laion2B-en with customized data filters as our main dataset. Since Laion2B is very large and typical training is less than one epoch, we usually do not need to download the complete dataset for training. Same story for VDs.

Directory of Laion2B for our code:

├── data
│   └── laion2b
│       └── data
│           └── 00000.tar
│           └── 00000.parquet
│           └── 00000_stats.jsom_
│           └── 00001.tar
│           └── ...

These compressed data are generated with img2dataset API official github link.

Setup

conda create -n versatile-diffusion python=3.8
conda activate versatile-diffusion
conda install pytorch==1.12.1 torchvision=0.13.1 -c pytorch
pip install -r requirement.txt

Pretrained models

All useful pretrained models can be downloaded from this link. The pretrained folder should include the following files:

├── pretrained
│   └── kl-f8.pth
│   └── optimus-vae.pth
│   └── sd-v1-4.pth
│   └── sd-variation-ema.pth
│   └── vd-dc.pth
│   └── vd-official.pth

Evaluation

Here are the one-line shell commands to evaluate SD baselines with multiple GPUs.

python main.py --config sd_eval --gpu 0 1 2 3 4 5 6 7 --eval 99999
python main.py --config sd_variation_eval --gpu 0 1 2 3 4 5 6 7 --eval 99999

Here are the one-line shell commands to evaluate VD models with multiple GPUs.

python main.py --config vd_dc_eval --gpu 0 1 2 3 4 5 6 7 --eval 99999
python main.py --config vd_official_eval --gpu 0 1 2 3 4 5 6 7 --eval 99999

All corresponding evaluation configs can be found in ./configs/experiment. There are useful information in the config. You can easy customized it and run your own batched evaluations.

For the commands above, you also need to:

  • Create ./pretrained and move all downloaded pretrained models in it.
  • Create ./log/sd_nodataset/99999_eval for baseline evaluations on Stable Diffusion
  • Create ./log/vd_nodataset/99999_eval for evaluations on Versatile Diffusion

Training

Coming soon

Citation

@article{xu2022versatile,
	title        = {Versatile Diffusion: Text, Images and Variations All in One Diffusion Model},
	author       = {Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi},
	year         = 2022,
	url          = {https://arxiv.org/abs/2211.08332},
	eprint       = {2211.08332},
	archiveprefix = {arXiv},
	primaryclass = {cs.CV}
}

Acknowledgement

Part of the codes reorganizes/reimplements code from the following repositories: LDM official Github, which also oriented from DDPM official Github.

About

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model, 2022

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%