Overview
Diffusion models are incremental updates where the assembly of the whole gives us the encoder-decoder structure. The transition from one state to another is realized by a denoiser.
The model whose structure is shown above is called the variational diffusion model (VDM). The model has a sequence of states
- : the original image, which is the same as in VAE
- : the latent variable, which is the same as in VAE, and we want to make it Gaussian:
- : the intermediate states, which are also the latent variables, but not white Gaussian
Building Blocks
Transition Block
The -th transition block consists of three states and . There are two possible paths to get to state
- The forward transition that goes from to . The associated transition distribution is . That is, we can sample from for given . Just like VAE, we do not have access to , so we will approximate it by a Gaussian
- The reverse transition goes from to . Again, we never know so we use another Gaussian to approximate it
Initial Block
Just without forward path.
Final Block
Just without reverse path, and the final variable should be Gaussian
Transition Distribution
In a denoising diffusion probabilistic model, the transition distribution is defined (yes, this is defined, not derived by training) as
That is, the transition distribution is a Gaussian with mean and variance . The choice of the scaling factor is to make sure that the variance magnitude is preserved so that it will not explode and vanish after many iterations
Tip
and are chosen so that the distribution of will become when is large enough.
From this transition distribution, we can obtain how will be distributed if we are given :
where
Evidence Lower Bound
Like in VAE, we need to maximize the ELBO to train the model, the ELBO in variational diffusion model is
The ELBO here consists of three components:
- Reconstruction. This term is based on the initial block. We use the log-likelihood to measure how good the neural network associated with can recover from the latent variable . The expectation is taken with respect to the samples drawn from during the encoding process.
- Prior Matching. The prior matching term is based on the final block. We use KL divergence to measure the difference between and . We want how is generated, , to be as close to our desired , namely , as possible. The samples here are which are drawn from because this distribution provides the forward sample generation process.
- Consistency. The consistency term is based on the transition blocks, which uses the KL divergence to measure the deviation between the forward path and the reverse path.
Proof of ELBO
Recall by the definition of ELBO, we have
This means that ELBO is an evidence lower bound for , and now we try to estimate it. Let's define the notation: means the collection of all state variable from to . We recall that the prior distribution is the distribution of the image , so
With Jensen's inequality, which states that for any random variable and any concave function , it holds that , so we have
where the first item can be decomposed into two parts:
and the second term is
where the last line used:
Rewrite the Consistency Term
If we want to maximize the ELBO above, we need to handle two opposite directions: and in the consistency and prior matching term. However, we could use Bayes' Theorem to avoid this:
The "" step conditions the probability on since we are primarily concerned with generating the data in denoising diffusion models. By introducing this conditioning on , we aim to incorporate information about the full data trajectory when estimating . For diffusion models, the reverse process can now be estimated with a noise predictor that approximates the distribution of given . The ELBO maximization will involve minimizing the KL divergence between the learned model and the posterior , that is, the new consistency term.
To be specific, starting from the Jenson inequality in section Proof of ELBO above:
And now we can apply Bayes' Theorem in the second term
Then we could continue the inequality
Now we have the new ELBO, where
and
To conclude
- Reconstruction. The new reconstruction term is the same as before. We are still maximizing the log-likelihood.
- Prior Matching. The new prior matching is simplified to the KL divergence between and . The change is due to the fact that we now condition upon . Thus, there is no need to draw samples from and take expectation.
- Consistency. The new consistency term is different from the previous one in two ways. Firstly, the running index starts at and ends at . Previously it was from to . Accompanied with this is the distribution matching, which is now between and . So, instead of asking a forward transition to match with a reverse transition, we use to construct a reverse transition and use it to match with
Derivation of the transition distribution given the initial state
Recall the definition of
and the transition distribution
with how we got via Bayes' Theorem
Then we get
Therefore
We know the product of Gaussians must also be a Gaussian, thus we just need to find the mean and variance of the product. For simplicity, treat the vectors as scalars. Consider the following mapping:
Then we need to analyze the function
The minimizer of is the mean of the resulting Gaussian. So, calculate
Setting yields . Note that , so
Similarly, since
And this gives us
Now we know the distribution of
Therefore, we can calculate the ELBO of VDM! Through some boring algebraic calculation (assuming as Gaussian):
So we get
We don't need to train the second term since each step of is fixed so there is no parameters to learn (the s are typically human-designed), and the is also fixed as a Gaussian
Training and Inference
In the ELBO we know
But what about ? Like in VAE's encoder, we could define it as
where is a trainable network with parameter . Now we get
Therefore ELBO can be simplified into
The first term is (using and )
Now we get
And therefore the loss function
Ignoring the constants and expectations, the main subject of interest, for a particular , is
So this is a denoising problem: we need to find a network such that the denoised image will be close to the ground truth .
Training Process
For every image in the training dataset:
- Repeat the following steps until convergence
- Pick a random time stamp
- Draw a sample , i.e.,
- Take gradient descent step on
Inference Process
Once the denoiser is trained, we can apply it to do the inference. The inference is about sampling images from the distributions over the sequence of states . Since it is the reverse diffusion process, we need to do it recursively via:
This leads to the inferencing algorithm
- Given a white noise vector
- Repeat the following for
- Calculate using the trained denoiser
- Update according to
Inversion by Direct Denoising (InDI)
If we look at the updating equation above, we could see the following form:
The first term is easy to understand, but what is "denoise"?
More about Denoise
Denoising is a generic procedure that removes noise from a noisy image. Given the observation model
A classical solution to find an estimator such that the mean squared error is minimized is
So, if we assume that during the forward path
we can find the solution of denoiser
Such a denoiser is called the minimum mean squared error (MMSE) denoiser. MMSE denoiser is not the "best" denoiser; It is only the optimal denoiser with respect to the mean squared error. Since mean squared error is never a good metric for image quality, minimizing the MSE will not necessarily give us a better image.
Incremental Denoising Steps
The previous section tells us that an MMSE denoiser is equivalent to the conditional expectation of the posterior distribution. Now we introduce incremental denoising. Suppose that we have a clean image and a noise image . Our goal is to form a linear combination of and via a simple equation
Now, consider a small step , then it holds that
If we define as the left hand side, replace by , and write as , then the above equation will become
where is a small step in time. This equation gives us an inference step. If you tell us the denoiser and suppose that you start with a noisy image , then you can iteratively apply this equation to retrieve the images
Equivalent Interpretations
Predicting noises
As discussed above, a Variational Diffusion Model can be trained by simply learning a neural network to predict the original natural image form an arbitrary noised version and its time index .
This interpretation is from the parameterization of , that is
And we can rewrite this to
Now we can change to a function of rather than , that is
Therefore, now we could set our predicted transition mean as
And after some calculations, we just need to optimize
We have therefore shown that learning a VDM by predicting the original image is equivalent to learning to predict the noise; empirically, however, some works have found that predicting the noise resulted in better performance
Predicting scores
As we have concluded
by Tweedie's Formula, we have
And we know the best estimate for the true mean of given is , now we get (write as for convenience)
Now we get another way to express
Therefore, we can also set our approximate denoising transition mean as
And the corresponding optimization problem becomes
One more problem is how to calculate , this would be easy if we apply the relationship used in predicting noises
This is equal to
Now we get
Tip
The score function measures how to move in data space to maximize the log probability; intuitively, since the source noise is added to a natural image to corrupt it, moving in its opposite direction "denoises" the image and would be the best update to increase the subsequent log probability.
Score-Matching Langevin Dynamics (SMLD)
In this section we dive into the intuition of the score-matching interpretation.
Energy-based Models
To begin to understand why optimizing a score function makes sense, we shall first introduce energy-based models. Remember the form of Boltzmann Distribution, we have
where is an arbitrarily flexible, parameterizable function called the energy function, and is a normalizing constant to ensure that . One way to learn this distribution is MLE, but this requires tractably computing the normalizing constant , which may not be possible for complex functions.
One way to avoid calculating or modeling the normalization constant is by using a neural network to learn the score function of distribution instead. This can be done by taking the derivative of the log of both sides of the equation above
which can be freely represented as a neural network without involving any normalization constants. The score model can be optimized by minimizing the Fisher Divergence with the ground truth score function
Langevin Dynamics
Imagine that we are given a distribution and suppose that we want to draw samples from . Langevin dynamics is an iterative procedure that allows us to draw samples according to the following equation
where is the step size which users can control, and is white noise.
Note that if we ignore the noise term at the end, the Langevin dynamics equation is literally gradient descent. The descent is carefully chosen that will converge to the distribution .
The goal of this descent sampling process is to find a where the probability is the highest, which is equivalent to solving the optimization
Warning
Note that this is not MLE. In maximum likelihood, the data point is fixed but the model parameters are changing. Here, the model parameters are fixed but the data point is changing.
And now we can understand the noise term , which literally changes the gradient descent to stochastic gradient descent. Instead of shooting for the deterministic optimum, the stochastic gradient descent climbs up the hill randomly, so that there would be a lower probability to stuck into a local maximum
However, we actually do not care about the maximum of , which is the final goal of the descent process. Instead, we care about the sampling process itself. As well as our descent algorithm is good at find peaks, then by repeatedly initialize the algorithm at a uniformly distributed location, we will eventually collects samples that will follow the distribution we want to match.
Info
Suppose we initialize uniformly distributed samples . We run Langevin updates for steps. The histograms of generated samples are shown in the figures below
In summary, Langevin dynamics gives us a more intuitive vision into the score-matching problem in energy-based models. In energy-based models, we learn the distribution of by matching a score function with . On the other hand, if we find a good estimation of the descent term in Langevin dynamics, the sampling process would do better in finding a peak in the neighborhood of its initial state , so that by initialize the uniformly in the interval we interested with, the sampling would follow the we want to approximate.