Model

A variational auto-encoder (VAE) is an approach to generate images from a latent code. The name "variational" comes from the factor that we use probability distributions to describe $x$ and $z$ . Instead of resorting to a deterministic procedure of converting $x$ to $z$ , we are more interested in ensuring that the distribution $p (x)$ can be mapped to a desired distribution $p (z)$ , and go backwards to $p (x)$ . Since it's hard to directly access $p (z ∣ x)$ (the decoder) and $p (x ∣ z)$ (the encoder), we consider the following two proxy distributions to approximate them:

$q_{ϕ} (z ∣ x)$ : The proxy for $p (z ∣ x)$ with learnable parameter $ϕ$ . We will make it Gaussian to simplify the computation
$p_{θ} (x ∣ z)$ : The proxy for $p (x ∣ z)$ with learnable parameter $θ$ . We will also make it Gaussian to simplify the computation So the whole procedure of VAE can be

x p (z ∣ x) \approx q_{ϕ} (z ∣ x) z p (x ∣ z) \approx p_{θ} (x ∣ z) x

ELBO in VAE Setting

In variational inference, minimizing the difference (here to minimize the KL divergence) between two probability distributions is equivalent to maximizing the ELBO. Here, we need to minimize $D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z ∣ x))$ , that is, to maximize

ELBO (ϕ) = E_{q_{ϕ} (z ∣ x)} [lo g p (x, z)] - E_{q_{ϕ} (z ∣ x)} lo g q_{ϕ} (z ∣ x)

However, the ELBO above may not too useful because it involves $p (x, z)$ , something we have no access to. So we need to do something more:

ELBO (ϕ) = E_{q_{ϕ} (z ∣ x)} [lo g p (x, z)] - E_{q_{ϕ} (z ∣ x)} lo g q_{ϕ} (z ∣ x) = E_{q_{ϕ} (z ∣ x)} [lo g p (x ∣ z)] + E_{q_{ϕ} (z ∣ x)} [lo g p (z)] - E_{q_{ϕ} (z ∣ x)} lo g q_{ϕ} (z ∣ x) = E_{q_{ϕ} (z ∣ x)} [lo g p (x ∣ z)] + E_{q_{ϕ} (z ∣ x)} [lo g \frac{p ( z )}{q _{ϕ} ( z ∣ x )}] = how good your decoder is E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] + how good your encoder is [- D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z))]

where in the last line we replace $p (x ∣ z)$ by its proxy $p_{θ} (x ∣ z)$ , and there are two terms:

Reconstruction. The first term is about the decoder. We want the decoder to produce a good image $x$ if we feed a latent $z$ into the decoder. So, we want to maximize $lo g p_{θ} (x ∣ z)$ . (We sample $X, Z$ from real distribution $p (x, z)$ , and the goal of decoder is to approximate $p_{θ}$ to $p$ ). The expectation here is taken with respect to the samples $z$ conditioned on $x$
Prior Matching. The second term is the KL divergence for the encoder. We want the encoder to turn $x$ into a latent vector $z$ such that the latent vector will follow our choice of (good) distribution such as a Gaussian distribution. To conclude, the training goal will be
Decoder. For given $(z, x)$ , find $θ$ to maximize $p_{θ} (x ∣ z)$
Encoder. For given $(z, x)$ , find $ϕ$ to minimize $D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z))$

Training VAE

Encoder

We know that $z$ is generated from the distribution $q_{ϕ} (z ∣ x)$ . We also know that $q_{ϕ} (z ∣ x)$ should be simple as a Gaussian. Assume that for any give input $x$ this Gaussian has a mean $μ$ and a covariance matrix $σ^{2} I$ . We use a deep neural network (DNN) to predict them:

μ = DNN μ_{ϕ} (x), σ^{2} = DNN σ_{ϕ}^{2} (x),

Therefore, the samples $z^{(ℓ)}$ can be sampled from the Gaussian distribution

z^{(ℓ)} \sim N (z^{ℓ} ∣ μ_{ϕ} (x^{(ℓ)}), σ_{ϕ}^{2} (x^{(ℓ)}) I)

Decoder

The decoder is implemented through a neural network, denoted as $decoder_{θ}$ . The job of the decoder network is to take a latent variable $z$ and generates an image $\hat{x}$ :

\hat{x} = decoder_{θ} (z)

Let's make one more assumption that the decoded image $\hat{x}$ and the ground truth image $x$ is Gaussian, that is

(\hat{x} - x) \sim N (0, σ_{dec}^{2})

Then, it follows that the distribution $p_{θ} (x ∣ z)$ (marked to be Gaussian)

lo g p_{θ} (x ∣ z) = lo g N (x ∣ decode_{θ} (z), σ_{dec}^{2} I) = lo g {\frac{1}{( 2 π σ _{dec}^{2} ) ^{D}} exp [- \frac{∥ x - decode _{θ} ( z ) ∥ ^{2}}{2 σ _{dec}^{2}}]} = - \frac{∥ x - decode _{θ} ( z ) ∥ ^{2}}{2 σ _{dec}^{2}} - lo g (2 π σ_{dec}^{2})^{D}

where $D$ is the dimension of $x$ . This equation says that the maximization of the likelihood term in ELBO is literally just the $ℓ_{2}$ loss between the decoded image and ground truth.

Loss Function

To approximate the expectation, we use Monte-Carlo simulation:

E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] \approx \frac{1}{L} ℓ = 1 \sum L lo g p_{θ} (x^{ℓ} ∣ z^{(ℓ)}), z^{(ℓ)} \sim q_{ϕ} (z ∣ x^{(ℓ)})

where $x^{(ℓ)}$ is the $ℓ$ -th sample in the training set, and the distribution $q_{θ}$ is $q_{ϕ} (z ∣ x^{(ℓ)}) = N (z ∣ μ_{ϕ} (x^{(ℓ)}), σ_{ϕ}^{2} (x^{(ℓ)}) I)$

Now we have the training loss of VAE

ϕ, θ ar g max {\frac{1}{L} ℓ = 1 \sum L lo g p_{θ} (x^{(ℓ)} ∣ z^{(ℓ)}) - D_{KL} (q_{ϕ} (z ∣ x^{(ℓ)}) ∥ p (z))}

where the first term can be simplified to the $ℓ_{2}$ loss between $x$ and $decode_{θ} (z)$ as mentioned above, and the second term can be simplified by the solution of KL divergence between two Gaussian distributions:

D_{KL} (N (μ_{0}, Σ_{0}), N (μ_{1}, Σ_{1})) = \frac{1}{2} (Tr (Σ_{1}^{- 1} Σ_{0}) - d + (μ_{1} - μ_{0})^{T} Σ_{1}^{- 1} (μ_{1} - μ_{0}) + lo g \frac{det Σ _{1}}{det Σ _{0}})

and in this case $μ_{0} = μ_{ϕ} (x^{(ℓ)}), Σ_{0} = σ_{ϕ}^{2} (x^{(ℓ)}) I, μ_{1} = 0, Σ_{1} = 1$ , thus

D_{KL} (q_{ϕ} (z ∣ x^{(ℓ)}) ∥ p (z)) = \frac{1}{2} ((σ_{ϕ}^{2} (x^{(ℓ)}))^{d} + μ_{ϕ} (x^{(ℓ)})^{T} μ_{ϕ} (x^{(ℓ)}) - d lo g (σ_{ϕ}^{2} (x^{(ℓ)})))

where $d$ is the dimension of the vector $z$

Reparameterization

Note that the latent variable $z$ in the loss function is sampled from $q_{ϕ} (z ∣ x)$ , which cannot be differentiated during the back-propagation process. So we need to express $z$ as some differentiable transformation of another random variable $ϵ$ , given $z$ and $ϕ$

z = g (ϵ, ϕ, x)

where the distribution of random variable $ϵ$ is independent of $x$ and $ϕ$ .

Specifically, the distribution $q_{ϕ} (z ∣ x)$ can be written as:

q_{ϕ} (z ∣ x) = N (z ∣ μ_{ϕ} (x), σ_{ϕ}^{2} (x) I) = μ_{ϕ} (x) + σ_{ϕ} (x) N (0, I) = μ_{ϕ} (x) + σ_{ϕ} (x) ϵ

where $ϵ \sim N (0, I)$ . By this way, the gradient of the loss function can be back-propagated to the parameter $ϕ$

Visualization of Latent Space

The main benefit of a variational autoencoder is that we're capable of learning smooth latent state representations of the input data.

For standard autoencoders, we simply need to learn an encoding which allows us to reproduce the input. As you can see in the left-most figure, focusing only on reconstruction loss does allow us to separate out the classes (in this case, MNIST digits).

However, there's an uneven distribution of data within the latent space. In other words, there are areas in latent space which don't represent any of our observed data. So we cannot just simply sample from the latent space to generate realistic images. On the flip side, if we only focus only on ensuring that the latent distribution is similar to the prior distribution (through our KL divergence loss term), we end up describing every observation using the same unit Gaussian. So we failed to describe the original data from the latent space.

Lin's Notes Garden

Explorer

Variational Auto-Encoder (VAE)