Model

A variational auto-encoder (VAE) is an approach to generate images from a latent code. The name “variational” comes from the factor that we use probability distributions to describe and . Instead of resorting to a deterministic procedure of converting to , we are more interested in ensuring that the distribution can be mapped to a desired distribution , and go backwards to . Since it's hard to directly access (the decoder) and (the encoder), we consider the following two proxy distributions to approximate them:

  • : The proxy for with learnable parameter . We will make it Gaussian to simplify the computation
  • : The proxy for with learnable parameter . We will also make it Gaussian to simplify the computation So the whole procedure of VAE can be

ELBO in VAE Setting

In variational inference, minimizing the difference (here to minimize the KL divergence) between two probability distributions is equivalent to maximizing the ELBO. Here, we need to minimize , that is, to maximize

However, the ELBO above may not too useful because it involves , something we have no access to. So we need to do something more:

where in the last line we replace by its proxy , and there are two terms:

  • Reconstruction. The first term is about the decoder. We want the decoder to produce a good image if we feed a latent into the decoder. So, we want to maximize . (We sample from real distribution , and the goal of decoder is to approximate to ). The expectation here is taken with respect to the samples conditioned on
  • Prior Matching. The second term is the KL divergence for the encoder. We want the encoder to turn into a latent vector such that the latent vector will follow our choice of (good) distribution such as a Gaussian distribution. To conclude, the training goal will be
  • Decoder. For given , find to maximize
  • Encoder. For given , find to minimize

Training VAE

Encoder

We know that is generated from the distribution . We also know that should be simple as a Gaussian. Assume that for any give input this Gaussian has a mean and a covariance matrix . We use a deep neural network (DNN) to predict them:

Therefore, the samples can be sampled from the Gaussian distribution

Decoder

The decoder is implemented through a neural network, denoted as . The job of the decoder network is to take a latent variable and generates an image :

Let's make one more assumption that the decoded image and the ground truth image is Gaussian, that is

Then, it follows that the distribution (marked to be Gaussian)

where is the dimension of . This equation says that the maximization of the likelihood term in ELBO is literally just the loss between the decoded image and ground truth.

Loss Function

To approximate the expectation, we use Monte-Carlo simulation:

where is the -th sample in the training set, and the distribution is

Now we have the training loss of VAE

where the first term can be simplified to the loss between and as mentioned above, and the second term can be simplified by the solution of KL divergence between two Gaussian distributions:

and in this case , thus

where is the dimension of the vector

Reparameterization

Note that the latent variable in the loss function is sampled from , which cannot be differentiated during the back-propagation process. So we need to express as some differentiable transformation of another random variable , given and

where the distribution of random variable is independent of and .

Specifically, the distribution can be written as:

where . By this way, the gradient of the loss function can be back-propagated to the parameter

Visualization of Latent Space

The main benefit of a variational autoencoder is that we're capable of learning smooth latent state representations of the input data.

For standard autoencoders, we simply need to learn an encoding which allows us to reproduce the input. As you can see in the left-most figure, focusing only on reconstruction loss does allow us to separate out the classes (in this case, MNIST digits).

However, there's an uneven distribution of data within the latent space. In other words, there are areas in latent space which don't represent any of our observed data. So we cannot just simply sample from the latent space to generate realistic images. On the flip side, if we only focus only on ensuring that the latent distribution is similar to the prior distribution (through our KL divergence loss term), we end up describing every observation using the same unit Gaussian. So we failed to describe the original data from the latent space.