Overview

Diffusion models are incremental updates where the assembly of the whole gives us the encoder-decoder structure. The transition from one state to another is realized by a denoiser.

The model whose structure is shown above is called the variational diffusion model (VDM). The model has a sequence of states $x_{0}, x_{1}, \dots, x_{T}$

$x_{0}$ : the original image, which is the same as $x$ in VAE
$x_{T}$ : the latent variable, which is the same as $z$ in VAE, and we want to make it Gaussian: $x_{T} \sim N (0, I)$
$x_{1}, \dots, x_{T - 1}$ : the intermediate states, which are also the latent variables, but not white Gaussian

Building Blocks

Transition Block

The $t$ -th transition block consists of three states $x_{t - 1}, x_{t}$ and $x_{t + 1}$ . There are two possible paths to get to state $x_{t}$

The forward transition that goes from $x_{t - 1}$ to $x_{t}$ . The associated transition distribution is $p (x_{t} ∣ x_{t - 1})$ . That is, we can sample $x_{t}$ from $p (x_{t} ∣ x_{t - 1})$ for given $x_{t - 1}$ . Just like VAE, we do not have access to $p (x_{t} ∣ x_{t - 1})$ , so we will approximate it by a Gaussian $q_{ϕ} (x_{t} ∣ x_{t - 1})$
The reverse transition goes from $x_{t + 1}$ to $x_{t}$ . Again, we never know $p (x_{t + 1} ∣ x_{t})$ so we use another Gaussian $p_{θ} (x_{t + 1} ∣ x_{t})$ to approximate it

Initial Block

Just without forward path.

Final Block

Just without reverse path, and the final variable $x_{T}$ should be Gaussian

Transition Distribution

In a denoising diffusion probabilistic model, the transition distribution $q_{ϕ} (x_{t} ∣ x_{t - 1})$ is defined (yes, this is defined, not derived by training) as

q_{ϕ} (x_{t} ∣ x_{t - 1}) = N (x_{t} ∣ α_{t} x_{t - 1}, (1 - α_{t}) I)

That is, the transition distribution is a Gaussian with mean $α_{t} x_{t}$ and variance $1 - α_{t}$ . The choice of the scaling factor $α_{t}$ is to make sure that the variance magnitude is preserved so that it will not explode and vanish after many iterations

Tip

$α_{t}$ and $1 - α_{t}$ are chosen so that the distribution of $x_{t}$ will become $N (0, I)$ when $t$ is large enough.

From this transition distribution, we can obtain how $x_{t}$ will be distributed if we are given $x_{0}$ :

q_{ϕ} (x_{t} ∣ x_{0}) = N (x_{t} ∣ \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I)

where $\overset{α}{ˉ}_{t} = \prod_{i = 1}^{t} α_{i}$

Evidence Lower Bound

Like in VAE, we need to maximize the ELBO to train the model, the ELBO in variational diffusion model is

ELBO_{ϕ, θ} (x) = E_{q_{ϕ} (x_{1} ∣ x_{0})} lo g how good the initial block is p_{θ} (x_{0} ∣ x_{1}) - E_{q_{ϕ} (x_{T - 1} ∣ x_{0})} how good the final block is D_{KL} (q_{ϕ} (x_{T} ∣ x_{T - 1}) ∥ p (x_{T})) - t = 1 \sum T - 1 E_{q_{ϕ}} (x_{t - 1}, x_{t + 1} ∣ x_{0}) how good the transition blocks are D_{KL} (q_{ϕ} (x_{t} ∣ x_{t - 1}) ∥ p_{θ} (x_{t} ∣ x_{t + 1}))

The ELBO here consists of three components:

Reconstruction. This term is based on the initial block. We use the log-likelihood $p_{θ} (x_{0} ∣ x_{1})$ to measure how good the neural network associated with $p_{θ}$ can recover from the latent variable $x_{1}$ . The expectation is taken with respect to the samples drawn from $q_{ϕ} (x_{1} ∣ x_{0})$ during the encoding process.
Prior Matching. The prior matching term is based on the final block. We use KL divergence to measure the difference between $q_{ϕ} (x_{T} ∣ x_{T - 1})$ and $p (x_{T})$ . We want how $x_{T}$ is generated, $q_{ϕ} (x_{T} ∣ x_{T - 1})$ , to be as close to our desired $p (x_{T})$ , namely $N (0, I)$ , as possible. The samples here are $x_{T - 1}$ which are drawn from $q_{ϕ} (x_{T - 1} ∣ x_{0})$ because this distribution provides the forward sample generation process.
Consistency. The consistency term is based on the transition blocks, which uses the KL divergence to measure the deviation between the forward path and the reverse path.

Proof of ELBO

Recall by the definition of ELBO, we have

lo g p (x) = ELBO (x) + D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z ∣ x))

This means that ELBO is an evidence lower bound for $lo g p (x)$ , and now we try to estimate it. Let's define the notation: $x_{0 : T} = {x_{0}, \dots, x_{T}}$ means the collection of all state variable from $t = 0$ to $t = T$ . We recall that the prior distribution $p (x)$ is the distribution of the image $x_{0}$ , so

lo g p (x) = lo g p (x_{0}) = lo g \int p (x_{0 : T}) d x_{1 : T} = lo g \int p (x_{0 : T}) \frac{q _{ϕ} ( x _{1 : T} ∣ x _{0} )}{q _{ϕ} ( x _{1 : T} ∣ x _{0} )} d x_{1 : T} = lo g \int q_{ϕ} (x_{1 : T} ∣ x_{0}) [\frac{p ( x _{0 : T} )}{q _{ϕ} ( x _{1 : T} ∣ x _{0} )}] d x_{1 : T} = lo g E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [\frac{p ( x _{0 : T} )}{q _{ϕ} ( x _{1 : T} ∣ x _{0} )}]

With Jensen's inequality, which states that for any random variable $X$ and any concave function $f$ , it holds that $f (E [X] \geq E [f (X)]$ , so we have

lo g p (x) = lo g E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [\frac{p ( x _{0 : T} )}{q _{ϕ} ( x _{1 : T} ∣ x _{0} )}] \geq E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{0 : T} )}{q _{ϕ} ( x _{1 : T} ∣ x _{0} )}] = E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{T} ) p ( x _{0} ∣ x _{1} ) \prod _{t = 2}^{T} p ( x _{t - 1} ∣ x _{t} )}{q _{ϕ} ( x _{T} ∣ x _{T - 1} ) \prod _{t = 1}^{T - 1} q _{ϕ} ( x _{t} ∣ x _{t - 1} )}] (by Markov property) = E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{T} ) p ( x _{0} ∣ x _{1} ) \prod _{t = 1}^{T - 1} p ( x _{t} ∣ x _{t + 1} )}{q _{ϕ} ( x _{T} ∣ x _{T - 1} ) \prod _{t = 1}^{T - 1} q _{ϕ} ( x _{t} ∣ x _{t - 1} )}] (shift t to t + 1) = E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{T} ) p ( x _{0} ∣ x _{1} )}{q _{ϕ} ( x _{T} ∣ x _{T - 1} )}] + E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g t = 1 \prod T - 1 \frac{p ( x _{t} ∣ x _{t + 1} )}{q _{ϕ} ( x _{t} ∣ x _{t - 1} )}]

where the first item can be decomposed into two parts:

E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{T} ) p ( x _{0} ∣ x _{1} )}{q _{ϕ} ( x _{T} ∣ x _{T - 1} )}] = E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g p (x_{0} ∣ x_{1})] + E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{T} )}{q _{ϕ} ( x _{T} ∣ x _{T - 1} )}] = Reconstruction E_{q_{ϕ} (x_{1} ∣ x_{0})} [lo g p (x_{0} ∣ x_{1})] - Prior Matching E_{q_{ϕ} (x_{T - 1}, x_{T} ∣ x_{0})} [D_{KL} (q_{ϕ} (x_{T} ∣ x_{T - 1}) ∥ p (x_{T}))]

and the second term is

E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g t = 1 \prod T - 1 \frac{p ( x _{t} ∣ x _{t + 1} )}{q _{ϕ} ( x _{t} ∣ x _{t - 1} )}] = t = 1 \sum T - 1 E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{t} ∣ x _{t + 1} )}{q _{ϕ} ( x _{t} ∣ x _{t - 1} )}] = t = 1 \sum T - 1 E_{q_{ϕ} (x_{t - 1}, x_{t}, x_{t + 1} ∣ x_{0})} [lo g \frac{p ( x _{t} ∣ x _{t + 1} )}{q _{ϕ} ( x _{t} ∣ x _{t - 1} )}] = - Consistency t = 1 \sum T - 1 E_{q_{ϕ} (x_{t - 1}, x_{t + 1} ∣ x_{0})} [D_{KL} (q_{ϕ} (x_{t} ∣ x_{t - 1}) ∥ p (x_{t} ∣ x_{t + 1}))]

where the last line used:

q_{ϕ} (x_{t - 1}, x_{t}, x_{t + 1} ∣ x_{0}) = q_{ϕ} (x_{t - 1}, x_{t}, x_{t + 1} ∣ x_{t - 1}, x_{t + 1}, x_{0}) \cdot q_{ϕ} (x_{t - 1}, x_{t + 1} ∣ x_{0}) = q_{ϕ} (x_{t} ∣ x_{t - 1}, x_{t + 1}, x_{0}) \cdot q_{ϕ} (x_{t - 1}, x_{t + 1} ∣ x_{0}) = q_{ϕ} (x_{t} ∣ x_{t - 1}) \cdot q_{ϕ} (x_{t - 1}, x_{t + 1} ∣ x_{0}) (by Markov properity)

Rewrite the Consistency Term

If we want to maximize the ELBO above, we need to handle two opposite directions: $q_{ϕ} (x_{t} ∣ x_{t - 1})$ and $p_{θ} (x_{t} ∣ x_{t + 1})$ in the consistency and prior matching term. However, we could use Bayes' Theorem to avoid this:

q (x_{t} ∣ x_{t - 1}) = \frac{q ( x _{t - 1} ∣ x _{t} ) q ( x _{t} )}{q ( x _{t - 1} )} ⟹ q (x_{t} ∣ x_{t - 1}, x_{0}) = \frac{q ( x _{t - 1} ∣ x _{t} , x _{0} ) q ( x _{t} ∣ x _{0} )}{q ( x _{t - 1} ∣ x _{0} )}

The " $⟹$ " step conditions the probability on $x_{0}$ since we are primarily concerned with generating the data $x_{0}$ in denoising diffusion models. By introducing this conditioning on $x_{0}$ , we aim to incorporate information about the full data trajectory when estimating $q (x_{t} ∣ x_{t - 1}, x_{0})$ . For diffusion models, the reverse process can now be estimated with a noise predictor that approximates the distribution of $x_{t - 1}$ given $x_{t}$ . The ELBO maximization will involve minimizing the KL divergence between the learned model $p_{θ}$ and the posterior $q_{ϕ} (x_{t - 1} ∣ x_{t}, x_{0})$ , that is, the new consistency term.

To be specific, starting from the Jenson inequality in section Proof of ELBO above:

lo g p (x) \geq E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{0 : T} )}{q _{ϕ} ( x _{1 : T} ∣ x _{0} )}] = E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{T} ) p ( x _{0} ∣ x _{1} ) \prod _{t = 2}^{T} p ( x _{t - 1} ∣ x _{t} )}{q _{ϕ} ( x _{1} ∣ x _{0} ) \prod _{t = 2}^{T} q _{ϕ} ( x _{t} ∣ x _{t - 1} , x _{0} )}] = E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{T} ) p ( x _{0} ∣ x _{1} )}{q _{ϕ} ( x _{1} ∣ x _{0} )}] + E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g t = 2 \prod T \frac{p ( x _{t - 1} ∣ x _{t} )}{q _{ϕ} ( x _{t} ∣ x _{t - 1} , x _{0} )}]

And now we can apply Bayes' Theorem in the second term

t = 2 \prod T \frac{p ( x _{t - 1} ∣ x _{t} )}{q _{ϕ} ( x _{t} ∣ x _{t - 1} , x _{0} )} = t = 2 \prod T \frac{p ( x _{t - 1} ∣ x _{t} )}{\frac{q _{ϕ} ( x _{t - 1} ∣ x _{t} , x _{0} ) q _{ϕ} ( x _{t} ∣ x _{0} )}{q _{ϕ} ( x _{t - 1} ∣ x _{0} )}} = t = 2 \prod T \frac{p ( x _{t - 1} ∣ x _{t} )}{q _{ϕ} ( x _{t - 1} ∣ x _{t} , x _{0} )} \times t = 2 \prod T \frac{q _{ϕ} ( x _{t - 1} ∣ x _{0} )}{q _{ϕ} ( x _{t} ∣ x _{0} )} = t = 2 \prod T \frac{p ( x _{t - 1} ∣ x _{t} )}{q _{ϕ} ( x _{t - 1} ∣ x _{t} , x _{0} )} \times \frac{q _{ϕ} ( x _{1} ∣ x _{0} )}{q _{ϕ} ( x _{T} ∣ x _{0} )}

Then we could continue the inequality

lo g p (x) \geq E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [log \frac{p ( x _{T} ) p ( x _{0} ∣ x _{1} )}{q _{ϕ} ( x _{1} ∣ x _{0} )}] + E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [log t = 2 \prod T \frac{p ( x _{t - 1} ∣ x _{t} )}{q _{ϕ} ( x _{t} ∣ x _{t - 1} , x _{0} )}] = E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{T} ) p ( x _{0} ∣ x _{1} )}{q _{ϕ} ( x _{1} ∣ x _{0} )} + lo g \frac{q _{ϕ} ( x _{1} ∣ x _{0} )}{q _{ϕ} ( x _{T} ∣ x _{0} )}] + E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g t = 2 \prod T \frac{p ( x _{t - 1} ∣ x _{t} )}{q _{ϕ} ( x _{t - 1} ∣ x _{t} , x _{0} )}] = E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{T} ) p ( x _{0} ∣ x _{1} )}{q _{ϕ} ( x _{T} ∣ x _{0} )}] + E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g t = 2 \prod T \frac{p ( x _{t - 1} ∣ x _{t} )}{q _{ϕ} ( x _{t - 1} ∣ x _{t} , x _{0} )}]

Now we have the new ELBO, where

First Term = E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{T} ) p ( x _{0} ∣ x _{1} )}{q _{ϕ} ( x _{T} ∣ x _{0} )}] = E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g p (x_{0} ∣ x_{1})] + E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{T} )}{q _{ϕ} ( x _{T} ∣ x _{0} )}] = Reconstruction E_{q_{ϕ} (x_{1} ∣ x_{0})} [lo g p (x_{0} ∣ x_{1})] - Prior Matching D_{KL} (q_{ϕ} (x_{T} ∣ x_{0}) ∥ p (x_{T}))

and

Sencond Term = E_{q_{ϕ} (x_{1 : T} ∣ x_{0})} [lo g t = 2 \prod T \frac{p ( x _{t - 1} ∣ x _{t} )}{q _{ϕ} ( x _{t - 1} ∣ x _{t} , x _{0} )}] = t = 2 \sum E_{q_{ϕ} (x_{t}, x_{t - 1} ∣ x_{0})} lo g \frac{p ( x _{t - 1} ∣ x _{t} )}{q _{ϕ} ( x _{t - 1} ∣ x _{t} , x _{0} )} = - Consistency t = 2 \sum E_{q_{ϕ} (x_{t}, x_{t - 1} ∣ x_{0})} D_{KL} (q_{ϕ} (x_{t - 1} ∣ x_{t}, x_{0}) ∥ p (x_{t - 1} ∣ x_{t}))

To conclude

ELB O_{ϕ, θ} (x) = E_{q_{ϕ} (x_{1} ∣ x_{0})} lo g Same Reconstruction p_{θ} (x_{0} ∣ x_{1}) - New Prior Matching D_{KL} (q_{ϕ} (x_{T} ∣ x_{0}) ∥ p (x_{T})) - t = 2 \sum T E_{q_{ϕ} (x_{t} ∣ x_{0})} New Consistency D_{KL} (q_{ϕ} (x_{t - 1} ∣ x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} ∥ x_{t}))

Reconstruction. The new reconstruction term is the same as before. We are still maximizing the log-likelihood.
Prior Matching. The new prior matching is simplified to the KL divergence between $q_{ϕ} (x_{T} ∣ x_{0})$ and $p (x_{T})$ . The change is due to the fact that we now condition upon $x_{0}$ . Thus, there is no need to draw samples from $q_{ϕ} (x_{T - 1} ∣ x_{0})$ and take expectation.
Consistency. The new consistency term is different from the previous one in two ways. Firstly, the running index $t$ starts at $t = 2$ and ends at $t = T$ . Previously it was from $t = 1$ to $t = T - 1$ . Accompanied with this is the distribution matching, which is now between $q_{ϕ} (x_{t - 1} ∣ x_{t}, x_{0})$ and $p_{θ} (x_{t - 1} ∣ x_{t})$ . So, instead of asking a forward transition to match with a reverse transition, we use $q_{ϕ}$ to construct a reverse transition and use it to match with $p_{θ}$

Derivation of the transition distribution given the initial state

Recall the definition of $q_{ϕ} (x_{t} ∣ x_{0})$

q_{ϕ} (x_{t} ∣ x_{t - 1}) = N (x_{t} ∣ α_{t} x_{t - 1}, (1 - α_{t}) I)

and the transition distribution

q_{ϕ} (x_{t} ∣ x_{0}) = N (x_{t} ∣ \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I), \overset{α}{ˉ}_{t} = i = 1 \prod t α_{i}

with how we got $q_{ϕ} (x_{t - 1} ∣ x_{t}, x_{0})$ via Bayes' Theorem

q (x_{t - 1} ∣ x_{t}, x_{0}) = \frac{q ( x _{t} ∣ x _{t - 1} , x _{0} ) q ( x _{t - 1} ∣ x _{0} )}{q ( x _{t} ∣ x _{0} )}

Then we get

q (x_{t - 1} ∣ x_{t}, x_{0}) = \frac{N ( x _{t} ∣ α _{t} x _{t - 1} , ( 1 - α _{t} ) I ) N ( x _{t - 1} ∣ α ˉ _{t - 1} x _{t - 1} , ( 1 - α ˉ _{t - 1} ) I )}{N ( x _{t} ∣ α ˉ _{t} x _{0} , ( 1 - α ˉ _{t} ) I )}

Therefore

q (x_{t - 1} ∣ x_{t}, x_{0}) \propto exp {\frac{( x _{t} - α _{t} x _{t - 1} ) ^{2}}{2 ( 1 - α _{t} )} + \frac{( x _{t - 1} - α ˉ _{t - 1} x _{0} ) ^{2}}{2 ( 1 - α ˉ _{t - 1} )} - \frac{( x _{t} - α ˉ _{t} x _{0} ) ^{2}}{2 ( 1 - α ˉ _{t} )}}

We know the product of Gaussians must also be a Gaussian, thus we just need to find the mean and variance of the product. For simplicity, treat the vectors as scalars. Consider the following mapping:

x = x_{t}, y = x_{t - 1}, z = x_{0}, a = α_{t}, b = \overset{α}{ˉ}_{t - 1}, c = \overset{α}{ˉ}_{t}

Then we need to analyze the function

f (y) = \frac{( x - a y ) ^{2}}{2 ( 1 - a )} + \frac{( y - b z ) ^{2}}{2 ( 1 - b )} - \frac{( x - c z ) ^{2}}{2 ( 1 - c )}

The minimizer of $f (y)$ is the mean of the resulting Gaussian. So, calculate

f^{'} (y) = \frac{1 - ab}{( 1 - a ) ( 1 - b )} y - (\frac{a}{1 - a} x + \frac{b}{1 - b} z)

Setting $f^{'} (y) = 0$ yields $y = \frac{( 1 - b ) a}{1 - ab} x + \frac{( 1 - a ) b}{1 - ab} z$ . Note that $ab = \overset{α}{ˉ}_{t}$ , so

μ_{q} (x_{t}, x_{0}) = \frac{( 1 - α _{t - 1} ) α _{t}}{1 - α _{t}} x_{t} + \frac{( 1 - α _{t} ) α _{t - 1}}{1 - α _{t}} x_{0}

Similarly, since

f^{''} (y) = \frac{1 - ab}{( 1 - a ) ( 1 - b )} = \frac{1 - α _{t}}{( 1 - α _{t} ) ( 1 - α _{t - 1} )}

And this gives us

Σ_{q} (t) = \frac{( 1 - α _{t} ) ( 1 - α _{t - 1} )}{1 - α _{t}} I

Now we know the distribution of $q_{ϕ} (x_{t - 1} ∣ x_{t}, x_{0})$

q_{ϕ} (x_{t - 1} ∣ x_{t}, x_{0}) = N (x_{t - 1} ∣ μ_{q} (x_{t}, x_{0}), Σ_{q} (t))

Therefore, we can calculate the ELBO of VDM! Through some boring algebraic calculation (assuming $p_{θ}$ as Gaussian):

D_{KL} (q_{ϕ} (x_{t - 1} ∣ x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} ∣ x_{t})) = D_{KL} (N (x_{t - 1} ∣ μ_{q} (x_{t}, x_{0}), σ_{q}^{2} (t) I ∥ N (x_{t - 1} ∣ μ_{θ} (x_{t}), σ_{q}^{2} (t) I)) = \frac{1}{2 σ _{q}^{2} ( t )} ∥ μ_{q} (x_{t}, x_{0}) - μ_{θ} (x_{t}) ∥

So we get

ELB O_{θ} (x) = E_{q_{ϕ} (x_{1} ∣ x_{0})} [lo g p_{θ} (x_{0} ∣ x_{1})] - Nothing to train D_{KL} (q_{ϕ} (x_{T} ∣ x_{0}) ∥ p (x_{T})) - t = 2 \sum T E_{q_{ϕ} (x_{t} ∣ x_{0})} [\frac{1}{2 σ _{q}^{2} ( t )} ∥ μ_{q} (x_{t}, x_{0}) - μ_{θ} (x_{t}) ∥^{2}]

We don't need to train the second term since each step of $q_{ϕ} (x_{t - 1} ∣ x_{t}, x_{0})$ is fixed so there is no parameters to learn (the $α_{t}$ s are typically human-designed), and the $p (x_{T})$ is also fixed as a Gaussian $N (0, I)$

Training and Inference

In the ELBO we know

μ_{q} (x_{t}, x_{0}) = \frac{( 1 - α _{t - 1} ) α _{t}}{1 - α _{t}} x_{t} + \frac{( 1 - α _{t} ) α _{t - 1}}{1 - α _{t}} x_{0}

But what about $μ_{θ} (x_{t})$ ? Like in VAE's encoder, we could define it as

A Network μ_{θ} (x_{t}) = def \frac{( 1 - α _{t - 1} ) α _{t}}{1 - α _{t}} x_{t} + \frac{( 1 - α _{t} ) α _{t - 1}}{1 - α _{t}} Another Network x_{θ} (x_{t})

where $\hat{x}_{θ}$ is a trainable network with parameter $θ$ . Now we get

\frac{1}{2 σ _{q}^{2} ( t )} ∥ μ_{q} (x_{t}, x_{0}) - μ_{θ} (x_{t}) ∥^{2} = \frac{1}{2 σ _{q}^{2} ( t )} \frac{( 1 - α _{t} ) α ˉ _{t - 1}}{1 - α ˉ _{t}} (\hat{x}_{θ} (x_{t}) - x_{0})^{2} = \frac{1}{2 σ _{q}^{2} ( t )} \frac{( 1 - α _{t} ) ^{2} α ˉ _{t - 1}}{( 1 - α ˉ _{t} ) ^{2}} ∥ \hat{x}_{θ} (x_{t}) - x_{0} ∥^{2}

Therefore ELBO can be simplified into

E_{q_{ϕ} (x_{1} ∣ x_{0})} [lo g p_{θ} (x_{0} ∣ x_{1})] - t = 2 \sum T E_{q_{ϕ} (x_{t} ∣ x_{0})} [\frac{1}{2 σ _{q}^{2} ( t )} \frac{( 1 - α _{t} ) ^{2} α ˉ _{t - 1}}{( 1 - α ˉ _{t} ) ^{2}} ∥ \hat{x}_{θ} (x_{t}) - x_{0} ∥^{2}]

The first term is (using $α_{0} = 1$ and $α_{1} = \overset{α}{ˉ}_{1}$ )

lo g p_{θ} (x_{0} ∣ x_{1}) = lo g N (x_{0} ∣ μ_{θ} (x_{1}), σ_{q}^{2} (1) I) \propto - \frac{1}{2 σ _{q}^{2} ( q )} ∥ μ_{θ} (x_{1}) - x_{0} ∥^{2} = - \frac{1}{2 σ _{q}^{2} ( q )} \frac{( 1 - α _{0} ) α _{1}}{1 - α _{1}} x_{1} + \frac{( 1 - α _{1} ) α _{0}}{1 - α _{1}} \hat{x}_{θ} (x_{1}) - x_{0}^{2} = - \frac{1}{2 σ _{q}^{2} ( q )} \frac{( 1 - α _{1} )}{1 - α _{1}} \hat{x}_{θ} (x_{1}) - x_{0}^{2} = - \frac{1}{2 σ _{q}^{2} ( q )} ∥ \hat{x}_{θ} (x_{1}) - x_{0} ∥^{2}

Now we get

ELBO_{θ} = - t = 1 \sum T E_{q_{ϕ} (x_{t} ∣ x_{0})} [\frac{1}{2 σ _{q}^{2} ( t )} \frac{( 1 - α _{t} ) ^{2} α ˉ _{t - 1}}{( 1 - α ˉ _{t} ) ^{2}} ∥ \hat{x}_{θ} (x_{t}) - x_{0} ∥^{2}]

And therefore the loss function

θ^{*} = θ ar g min t = 1 \sum T \frac{1}{2 σ _{q}^{2} ( t )} \frac{( 1 - α _{t} ) ^{2} α ˉ _{t - 1}}{( 1 - α ˉ _{t} ) ^{2}} E_{q_{ϕ} (x_{t} ∣ x_{0})} [∥ \hat{x}_{θ} (x_{t}) - x_{0} ∥^{2}]

Ignoring the constants and expectations, the main subject of interest, for a particular $x_{t}$ , is

θ ar g min ∥ \hat{x}_{θ} (x_{t}) - x_{0} ∥^{2}

So this is a denoising problem: we need to find a network $\hat{x}_{θ}$ such that the denoised image $\hat{x}_{θ} (x_{t})$ will be close to the ground truth $x_{0}$ .

Training Process

For every image $x_{0}$ in the training dataset:

Repeat the following steps until convergence
Pick a random time stamp $t \sim Uniform [1, T]$
Draw a sample $x_{t} \sim q_{ϕ} (x_{t} ∣ x_{0}) = N (x_{t} ∣ \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I)$ , i.e.,

x_{t} = \overset{α}{ˉ}_{t} x_{0} + (1 - \overset{α}{ˉ}_{t}) z, z \sim N (0, I)

Take gradient descent step on

\nabla_{θ} ∥ \hat{x}_{θ} (x_{t}) - x_{0} ∥^{2}

Inference Process

Once the denoiser $\hat{x}_{θ}$ is trained, we can apply it to do the inference. The inference is about sampling images from the distributions $p_{θ} (x_{t - 1} ∣ x_{t})$ over the sequence of states $x_{T}, x_{T - 1}, \dots, x_{1}$ . Since it is the reverse diffusion process, we need to do it recursively via:

x_{t - 1} \sim p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1} ∣ μ_{θ} (x_{t}), σ_{q}^{2} (t) I) = μ_{θ} (x_{t}) + σ_{q}^{2} (t) z, z \sim N (0, I) = \frac{( 1 - α _{t - 1} ) α _{t}}{1 - α _{t}} x_{t} + \frac{( 1 - α _{t} ) α _{t - 1}}{1 - α _{t}} \hat{x}_{θ} (x_{t}) + σ_{q} (t) z

This leads to the inferencing algorithm

Given a white noise vector $x_{T} \sim N (0, I)$
Repeat the following for $t = T, T - 1, \dots, 1$
- Calculate $\hat{x}_{θ} (x_{t})$ using the trained denoiser
- Update according to

x_{t - 1} = \frac{( 1 - α _{t - 1} ) α _{t}}{1 - α _{t}} x_{t} + \frac{( 1 - α _{t} ) α _{t - 1}}{1 - α _{t}} \hat{x}_{θ} (x_{t}) + σ_{q} (t) z, z \sim N (0, I)

Inversion by Direct Denoising (InDI)

If we look at the updating equation above, we could see the following form:

x_{t - 1} = (something) \cdot x_{t} + (something else) \cdot denoise (x_{t}) + noise

The first term is easy to understand, but what is "denoise"?

More about Denoise

Denoising is a generic procedure that removes noise from a noisy image. Given the observation model

y = x + ϵ, ϵ \sim N (0, I)

A classical solution to find an estimator $g (\cdot)$ such that the mean squared error is minimized is

denoise (y) = g ar g min E_{x, y} [∥ g (y) - x ∥^{2}] = some steps = E [x ∣ y]

So, if we assume that during the forward path

x_{t} = x_{t - 1} + ϵ_{t - 1}, ϵ \sim N (0, I)

we can find the solution of denoiser

denoise (x_{t}) = E [x_{t - 1} ∣ x_{t}]

Such a denoiser is called the minimum mean squared error (MMSE) denoiser. MMSE denoiser is not the "best" denoiser; It is only the optimal denoiser with respect to the mean squared error. Since mean squared error is never a good metric for image quality, minimizing the MSE will not necessarily give us a better image.

Incremental Denoising Steps

The previous section tells us that an MMSE denoiser is equivalent to the conditional expectation of the posterior distribution. Now we introduce incremental denoising. Suppose that we have a clean image $x_{0}$ and a noise image $y$ . Our goal is to form a linear combination of $x_{0}$ and $y$ via a simple equation

x_{t} = (1 - t) x_{0} + t y, 0 \leq t \leq 1

Now, consider a small step $τ : 0 \leq τ < t \leq 1$ , then it holds that

E [x_{t - τ} ∣ x_{t}] = (1 - \frac{τ}{t}) current estimate x_{t} + \frac{τ}{t} denoised E [x_{0} ∣ x_{t}]

If we define $\hat{x}_{t - τ}$ as the left hand side, replace $x_{t}$ by $\hat{x}_{t}$ , and write $E [x_{0} ∣ x_{t}]$ as $denoise (\hat{x}_{t})$ , then the above equation will become

\hat{x}_{t - τ} = (1 - \frac{τ}{t}) \cdot \hat{x}_{t} + \frac{τ}{t} denoise (\hat{x}_{t})

where $τ$ is a small step in time. This equation gives us an inference step. If you tell us the denoiser and suppose that you start with a noisy image $y$ , then you can iteratively apply this equation to retrieve the images $\hat{x}_{t - 1}, \hat{x}_{t - 2}, \dots, \hat{x}_{0}$

Equivalent Interpretations

Predicting noises

As discussed above, a Variational Diffusion Model can be trained by simply learning a neural network to predict the original natural image $x_{0}$ form an arbitrary noised version $x_{t}$ and its time index $t$ .

This interpretation is from the parameterization of $x_{t}$ , that is

x_{t} = \overset{α}{ˉ}_{t} x_{0} + (1 - \overset{α}{ˉ}_{t}) z, z \sim N (0, I)

And we can rewrite this to

x_{0} = \frac{x _{t} - 1 - α ˉ _{t} z}{α ˉ _{t}}

Now we can change $μ_{q} (x_{t}, x_{0})$ to a function of $x_{t}, z$ rather than $x_{0}, z$ , that is

μ_{q} (x_{t}, x_{0}) = \frac{α _{t} ( 1 - α ˉ _{t - 1} ) x _{t} + α ˉ _{t - 1} ( 1 - α _{t} ) x _{0}}{1 - α ˉ _{t}} = \frac{α _{t} ( 1 - α ˉ _{t - 1} ) x _{t} + α ˉ _{t - 1} ( 1 - α _{t} ) \frac{x _{t} - 1 - α ˉ _{t} z}{α ˉ _{t}}}{1 - α ˉ _{t}} = \frac{α _{t} ( 1 - α ˉ _{t - 1} ) x _{t}}{1 - α ˉ _{t}} + \frac{( 1 - α _{t} ) x _{t}}{( 1 - α ˉ _{t} ) α _{t}} - \frac{( 1 - α _{t} ) 1 - α ˉ _{t} z}{( 1 - α ˉ _{t} ) α _{t}} = (\frac{α _{t} ( 1 - α ˉ _{t - 1} )}{( 1 - α ˉ _{t} ) α _{t}} + \frac{1 - α _{t}}{( 1 - α ˉ _{t} ) α _{t}}) x_{t} - \frac{1 - α _{t}}{1 - α ˉ _{t} α _{t}} z = \frac{α _{t} - α ˉ _{t} + 1 - α _{t}}{( 1 - α ˉ _{t} ) α _{t}} x_{t} - \frac{1 - α _{t}}{1 - α ˉ _{t} α _{t}} z = \frac{1}{α _{t}} x_{t} - \frac{1 - α _{t}}{1 - α ˉ _{t} α _{t}} z

Therefore, now we could set our predicted transition mean as

μ_{θ} (x_{t}, t) = \frac{1}{α _{t}} x_{t} - \frac{1 - α _{t}}{1 - α ˉ _{t} α _{t}} \hat{z}_{θ} (x_{t}, t)

And after some calculations, we just need to optimize

θ^{*} = θ ar g min t = 1 \sum T \frac{1}{2 σ _{q}^{2} ( t )} \frac{( 1 - α _{t} ) ^{2}}{( 1 - α ˉ _{t} ) ^{2} α _{t}} E_{q_{ϕ} (x_{t} ∣ x_{0})} [∥ \hat{z}_{θ} (x_{t}) - z_{0} ∥^{2}]

We have therefore shown that learning a VDM by predicting the original image $x_{0}$ is equivalent to learning to predict the noise; empirically, however, some works have found that predicting the noise resulted in better performance

Predicting scores

As we have concluded

q (x_{t} ∣ x_{0}) = N (x_{t} ∣ \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I)

by Tweedie's Formula, we have

E [μ_{x_{t}} ∣ x_{t}] = x_{t} + (1 - \overset{α}{ˉ}_{t}) \nabla_{x_{t}} lo g p (x_{t})

And we know the best estimate for the true mean of $x_{t}$ given $x_{0}$ is $\overset{α}{ˉ}_{t} x_{0}$ , now we get (write $\nabla_{x_{t}}$ as $\nabla$ for convenience)

\overset{α}{ˉ}_{t} x_{0} = x_{t} + (1 - \overset{α}{ˉ}_{t}) \nabla lo g p (x_{t})

Now we get another why to express $μ_{q} (x_{t}, x_{0})$

μ_{q} (x_{t}, x_{0}) = \frac{α _{t} ( 1 - α ˉ _{t - 1} ) x _{t} + α ˉ _{t - 1} ( 1 - α _{t} ) x _{0}}{1 - α ˉ _{t}} = \frac{α _{t} ( 1 - α ˉ _{t - 1} ) x _{t} + α ˉ _{t - 1} ( 1 - α _{t} ) \frac{x _{t} + ( 1 - α ˉ _{t} ) \nabla l o g p ( x _{t} )}{α ˉ _{t}}}{1 - α ˉ _{t}} = \frac{α _{t} ( 1 - α ˉ _{t - 1} ) x _{t}}{1 - α ˉ _{t}} + \frac{( 1 - α _{t} ) x _{t}}{( 1 - α ˉ _{t} ) α _{t}} + \frac{( 1 - α _{t} ) ( 1 - α ˉ _{t} ) \nabla lo g p ( x _{t} )}{( 1 - α ˉ _{t} ) α _{t}} = (\frac{α _{t} ( 1 - α ˉ _{t - 1} )}{( 1 - α ˉ _{t} ) α _{t}} + \frac{1 - α _{t}}{( 1 - α ˉ _{t} ) α _{t}}) x_{t} + \frac{1 - α _{t}}{α _{t}} \nabla lo g p (x_{t}) = \frac{α _{t} - α ˉ _{t} + 1 - α _{t}}{( 1 - α ˉ _{t} ) α _{t}} x_{t} + \frac{1 - α _{t}}{α _{t}} \nabla lo g p (x_{t}) = \frac{1}{α _{t}} x_{t} + \frac{1 - α _{t}}{α _{t}} \nabla lo g p (x_{t})

Therefore, we can also set our approximate denoising transition mean $μ_{θ} (x_{t})$ as

μ_{θ} (x_{t}) = \frac{1}{α _{t}} x_{t} + \frac{1 - α _{t}}{α _{t}} s_{θ} (x_{t}, t)

And the corresponding optimization problem becomes

θ^{*} = θ ar g min t = 1 \sum T \frac{1}{2 σ _{q}^{2} ( t )} \frac{( 1 - α _{t} ) ^{2}}{α _{t}} E_{q_{ϕ} (x_{t} ∣ x_{0})} [∥ s_{θ} (x_{t}) - \nabla lo g p (x_{t}) ∥^{2}]

One more problem is how to calculate $\nabla lo g p (x_{t})$ , this would be easy if we apply the relationship used in predicting noises

x_{0} = \frac{x _{t} - 1 - α ˉ _{t} z}{α ˉ _{t}}

This is equal to

x_{0} = \frac{x _{t} + ( 1 - α ˉ _{t} ) \nabla lo g p ( x _{t} )}{α ˉ _{t}}

Now we get

\nabla lo g p (x_{t}) = - \frac{1}{1 - α ˉ _{t}} z

Tip

The score function measures how to move in data space to maximize the log probability; intuitively, since the source noise is added to a natural image to corrupt it, moving in its opposite direction "denoises" the image and would be the best update to increase the subsequent log probability.

Score-Matching Langevin Dynamics (SMLD)

In this section we dive into the intuition of the score-matching interpretation.

Energy-based Models

To begin to understand why optimizing a score function makes sense, we shall first introduce energy-based models. Remember the form of Boltzmann Distribution, we have

p_{θ} (x) = \frac{1}{Z _{θ}} exp [- f_{θ} (x)]

where $f_{θ}$ is an arbitrarily flexible, parameterizable function called the energy function, and $Z_{θ}$ is a normalizing constant to ensure that $\int p_{θ} (x) d x = 1$ . One way to learn this distribution is MLE, but this requires tractably computing the normalizing constant $Z_{θ} = \int exp [- f_{θ} (x)]$ , which may not be possible for complex $f_{θ} (x)$ functions.

One way to avoid calculating or modeling the normalization constant is by using a neural network $s_{θ} (x)$ to learn the score function $\nabla lo g p (x)$ of distribution $p (x)$ instead. This can be done by taking the derivative of the log of both sides of the equation above

\nabla_{x} lo g p_{θ} (x) = \nabla_{x} lo g {\frac{1}{Z _{θ}} exp [- f_{θ} (x)]} = \nabla_{x} lo g \frac{1}{Z _{θ}} + \nabla_{x} lo g exp [- f_{θ} (x)] = - \nabla_{x} f_{θ} (x) \approx s_{θ} (x)

which can be freely represented as a neural network without involving any normalization constants. The score model can be optimized by minimizing the Fisher Divergence with the ground truth score function

E_{p (x)} [∥ s_{θ} (x) - \nabla lo g p (x) ∥^{2}]

Langevin Dynamics

Imagine that we are given a distribution $p (x)$ and suppose that we want to draw samples from $p (x)$ . Langevin dynamics is an iterative procedure that allows us to draw samples according to the following equation

x_{t + 1} = x_{t} + τ \nabla_{x} lo g p (x_{t}) + 2 τ z, z \sim N (0, I)

where $τ$ is the step size which users can control, and $x_{0}$ is white noise.

Note that if we ignore the noise term $2 τ z$ at the end, the Langevin dynamics equation is literally gradient descent. The descent $\nabla_{x} lo g p (x)$ is carefully chosen that $x_{t}$ will converge to the distribution $p (x)$ .

The goal of this descent sampling process is to find a $x$ where the probability is the highest, which is equivalent to solving the optimization

x^{*} = x ar g max lo g p (x)

Warning

Note that this is not MLE. In maximum likelihood, the data point $x$ is fixed but the model parameters are changing. Here, the model parameters are fixed but the data point is changing.

And now we can understand the noise term $2 τ z$ , which literally changes the gradient descent to stochastic gradient descent. Instead of shooting for the deterministic optimum, the stochastic gradient descent climbs up the hill randomly, so that there would be a lower probability to stuck into a local maximum

However, we actually do not care about the maximum of $p (x)$ , which is the final goal of the descent process. Instead, we care about the sampling process itself. As well as our descent algorithm is good at find peaks, then by repeatedly initialize the algorithm at a uniformly distributed location, we will eventually collects samples that will follow the distribution we want to match.

Info

Suppose we initialize $10000$ uniformly distributed samples $x_{0} \sim Uniform [- 3, 3]$ . We run Langevin updates for $t = 100$ steps. The histograms of generated samples are shown in the figures below

In summary, Langevin dynamics gives us a more intuitive vision into the score-matching problem in energy-based models. In energy-based models, we learn the distribution of $p_{θ}$ by matching a score function $s_{θ}$ with $\nabla lo g p (x)$ . On the other hand, if we find a good estimation $s_{θ}$ of the descent term $\nabla lo g p (x)$ in Langevin dynamics, the sampling process would do better in finding a peak in the neighborhood of its initial state $x_{0}$ , so that by initialize the $x_{0}$ uniformly in the interval we interested with, the sampling would follow the $p (x)$ we want to approximate.

Lin's Notes Garden

Explorer

Denoising Diffusion Probabilistic Models (DDPM)

Overview

Building Blocks

Transition Block

Initial Block

Final Block

Transition Distribution

Evidence Lower Bound

Proof of ELBO

Rewrite the Consistency Term

Derivation of the transition distribution given the initial state

Training and Inference

Training Process

Inference Process

Inversion by Direct Denoising (InDI)

More about Denoise

Incremental Denoising Steps

Equivalent Interpretations

Predicting noises

Predicting scores

Score-Matching Langevin Dynamics (SMLD)

Energy-based Models

Langevin Dynamics

Graph View

Table of Contents

Backlinks