Structured Denoising Diffusion Models in Discrete State-Spaces

Getting Started

See annotated_diffusion.ipynb for a walkthrough about the forward diffusion process.

See annotated_traning.ipynb for a detailed guide to train a model and generate samples from the model.

Training with fashion_mnist with 20 epochs, here are a few example GIFs of the image generation process:

Read Papers to learn more:

Structured Denoising Diffusion Models in Discrete State-Spaces

Understanding Diffusion Models: A Unified Perspective

Variational Autoencoder (VAE)

Parameters

Joint distribution: $p(x,z)$
Posterior: $q_\phi(z|x)$

Evidence Lower Bound

$$ \begin{align*} \log p(x) &\geq \mathbb{E}{q\phi(z|x)}[\log \frac{p(x,z)}{q_\phi(z|x)}] \ &= \mathbb{E}{q\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \ || \ p(z)) \end{align*} $$

$\mathbb{E}{q\phi(z|x)}[\log p_\theta(x|z)]$ measures the reconstruction likelihood of the decoder from the variational distribution. (Monte Carlo estimate)
$D_{KL}(q_\phi(z|x) \ || \ p(z))$ measures how similar the learned variational distribution is to a prior belief held over latent variables. (Analytical calculation)

Objective

$$ \begin{align*} \arg\max_{\phi,\theta} \mathbb{E}{q\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \ || \ p(z)) \ \approx \arg\max_{\phi,\theta} \sum^L_{l=1}[\log p_\theta(x|z^{(l)})] - D_{KL}(q_\phi(z|x) \ || \ p(z)) \end{align*} $$

where latents ${ z^{(l)} }^L_{l=1}$ are sampled from $q_\phi(z|x)$.

Hierarchical Variational Autoencoder (HVAE)

Stacking VAEs on top of each other.

Parameters

Joint distribution: $p(x,z_{1:T}) = p(z_T)p_\theta(x|z_1)\prod^T_{t=2}p_\theta(z_{t-1}|z_t)$
Posterior: $q_\phi(z|x) = q_\theta(z_1|x)\prod^T_{t=2}q_\theta(z_t|z_{t-1})$

Evidence Lower Bound

$$ \begin{align*} \log p(x) &\geq \mathbb{E}{q\phi(z_{1:T}|x)}[\log \frac{p(x,z_{1:T})}{q_\phi(z_{1:T}|x)}] \ &= \mathbb{E}{q\phi(z_{1:T}|x)}[\log \frac{p(z_T)p_\theta(x|z_1)\prod^T_{t=2}p_\theta(z_{t-1}|z_t)}{q_\theta(z_1|x)\prod^T_{t=2}q_\theta(z_t|z_{t-1})}] \end{align*} $$

Objective

Similar to VAE.

Variational Diffusion Models (VDM)

HVAE but with three key restrictions:

The latent dimension is exactly equal to the data dimension $\implies q_\phi(z_{1:T}|x)= q(z_{1:T}|x_0) = \prod^T_{t=1}q(x_t|x_{t-1})$
The structure of the latent encoder at each timestep is not learned; it is pre-defined as a linear Gaussian model $\implies$ The latent encoder is a Gaussian distribution centered around the output of the previous timestep $\implies q(x_t|x_{t-1}) = \mathcal{N}(x_t;\sqrt{\alpha_t}x_{t-1}, (1-\alpha_t)\pmb{I})$
The Gaussian parameters of the latent encoders vary over time in such a way that the distribution of the latent at final timestep T is a standard Gaussian $\implies p(x_T)=\mathcal{N}(x_T;0,\pmb{I})$, which is pure noise

Parameters

Joint distribution: $p(x_{0:T})$
Posterior: $q(z_{1:T}|x_0) = \prod^T_{t=1}q(x_t|x_{t-1})$

Evidence Lower Bound

$$ \begin{align*} \log p(x) \geq \mathbb{E}{q(x_1|x_0)}[\log p\theta(x_0|x_1)] - D_{KL}(q(x_T|x_0) \ || \ p(x_T)) \ - \sum^T_{t=2} \mathbb{E}{q(x_t|x_0)}[D{KL}(q(x_{t-1}|x_t,x_0) \ || \ p_\theta(x_{t-1}|x_t))] \end{align*} $$

The first term measures the reconstruction likelihood of the decoder from the variational distribution. (Monte Carlo estimate)
The second term measures how close the distribution of the final nosisified input is to the standard Gaussian prior.

Note that it has no trainable parameters, and is also equal to zero under the assumptions.

The third term is for denoising matching. We learn desired denoising transition step $p_\theta(x_{t-1}|x_t)$ as an approximation to tracable, ground-truth denoising transition step $q(x_{t-1}|x_t, x_0)$.

Note that when $T=1$, VDM's ELBO falls back into VAE's.

Note that the denoising matching term dominates the overall optimization cost because of the summation term.

Objective

For learning a neural network to predict the original ground truth image from an arbitrarily noisified version of it, minimize the summation term of the derived ELBO objective across all noise levels, which can be approximated by minimizing the expectation over all timesteps:

$$ \arg\min_\theta \mathbb{E}{t \sim U{2,T}}[\mathbb{E}{q(x_t|x_0)}D_{KL}(q(x_{t-1}|x_t,x_0) \ || \ p_\theta(x_{t-1}|x_t))] $$

which can be optimized using stochastic samples over timesteps.

For generating a novel $x_0$, sample Gaussian noise from $p(x_T) and iteratively running the denoising transitions $p_\theta(x_{t-1} | x_t)$ for T steps.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
gifs		gifs
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
annotated_diffusion.ipynb		annotated_diffusion.ipynb
annotated_training.ipynb		annotated_training.ipynb
dog.jpg		dog.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Structured Denoising Diffusion Models in Discrete State-Spaces

Getting Started

Variational Autoencoder (VAE)

Parameters

Evidence Lower Bound

Objective

Hierarchical Variational Autoencoder (HVAE)

Parameters

Evidence Lower Bound

Objective

Variational Diffusion Models (VDM)

Parameters

Evidence Lower Bound

Objective

About

Releases

Packages

Languages

License

justinpan0/denoising-diffusion

Folders and files

Latest commit

History

Repository files navigation

Structured Denoising Diffusion Models in Discrete State-Spaces

Getting Started

Variational Autoencoder (VAE)

Parameters

Evidence Lower Bound

Objective

Hierarchical Variational Autoencoder (HVAE)

Parameters

Evidence Lower Bound

Objective

Variational Diffusion Models (VDM)

Parameters

Evidence Lower Bound

Objective

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages