Intuition

Suppose we have an observable random variable $X$ , and we want to find its true distribution $p^{*}$ . This would allow us to generate data by sampling (such as VAE), and estimate probabilities of future events. In general, it is impossible to find $p^{*}$ exactly, forcing us to search for a good approximation

To do this, we define a sufficiently large parametric family ${p_{θ}}_{θ \in Θ}$ of distributions (often "nice distributions" such as Gaussian distributions), then solve for $θ min L (p_{θ}, p^{*})$ for some loss function $L$ . One possible way to solve this is by considering small variation from $p_{θ}$ to $p_{θ + δ θ}$ , and solve for $δ L = L (p_{θ + δ θ}, p^{*}) - L (p_{θ}, p^{*}) = 0$ . This is called the variational method

Variational Bayesian Inference

We consider implicitly parametrized probability distributions, and we define

A simple distribution $p (z)$ over a latent random variable $Z$ . Usually a normal distribution or a uniform distribution
A family of complicated functions $f_{θ}$ (such as deep neural network) parametrized by $θ$
A way to convert any $f_{θ} (z)$ into a simple distribution over the observable random variable $X$ . For example, let $f_{θ} (z) = (f_{1} (z), f_{2} (z))$ have two outputs, then we can define the corresponding distribution over $X$ to be the normal distribution $N (f_{1} (z), e^{f_{x} (z)})$ All these define a family of joint distribution $p_{θ}$ over $(X, Z)$ . We can sample $(x, z)$ from $p_{θ}$ by first sampling $z \sim p_{θ}$ , and then sample $x$ using $f_{θ} (z)$

In other words, we have a generative model for both the observable and the latent. We consider a distribution $p_{θ}$ good, if it is a close approximation of $p^{*}$ , namely $p_{θ} (X) \approx p^{*} (X)$

Since the $p^{*} (X)$ above is over $X$ only, so $p_{θ} (X)$ must be derived by marginalizing the latent variable $Z$ away. However, in general it's impossible to perform the integral $p_{θ} (x) = \int p_{θ} (x ∣ z) p (z) d z$ , forcing us to perform another approximation

From Bayes' Theorem we can know $p_{θ} (x) = \frac{p _{θ} ( x ∣ z ) p ( z )}{p _{θ} ( z ∣ x )}$ . Here, we already know $p_{θ} (x ∣ z)$ and $p (z)$ , so if we can find a good approximation of $p_{θ} (z ∣ x)$ , then we can get $p_{θ} (x)$ . Therefore, we define another distribution family $q_{ϕ} (z ∣ x)$ and use it to approximate $p_{θ} (z ∣ x)$ .

In Bayesian language, $X$ is the observed evidence, and $Z$ is the latent/unobserved. The distribution $p$ over $Z$ is the prior distribution over $Z$ , $p_{θ} (x ∣ z)$ is the likelihood function, and $p_{θ} (z ∣ x)$ is the posterior distribution over $Z$

Solve the Problem

See Evidence Lower Bound (ELBO)

Lin's Notes Garden

Explorer

Variational Inference

Intuition

Variational Bayesian Inference

Solve the Problem

Graph View

Table of Contents

Backlinks