Motivation

Consider a classification problem in which we want to learn to distinguish between elephants ( $y = 1$ ) and dogs $(y = 0)$ , based on some features of an animal. Given a training set, an algorithm like logistic regression or the perceptron algorithm (basically) tries to find a straight line—that is, a decision boundary—that separates the elephants and dogs. Then, to classify a new animal as either an elephant or a dog, it checks on which side of the decision boundary it falls, and makes its prediction accordingly.

Here’s a different approach. First, looking at elephants, we can build a model of what elephants look like. Then, looking at dogs, we can build a separate model of what dogs look like. Finally, to classify a new animal, we can match the new animal against the elephant model, and match it against the dog model, to see whether the new animal looks more like the elephants or more like the dogs we had seen in the training set.

Algorithms that try to learn $p (y ∣ x)$ directly (such as logistic regression), or algorithms that try to learn mapping directly from the space of inputs $X$ to the labels ${0, 1}$ (such as the perceptron algorithm) are called discriminative learning algorithms. Here, we’ll talk about algorithms that instead try to model $p (x ∣ y)$ . These algorithms are called generative learning algorithms. For instance, if $y$ indicates whether an example is a dog (0) or an elephant (1), then $p (x ∣ y = 0)$ models the distribution of dogs' features, and $p (x ∣ y = 1)$ models the distribution of elephant's features.

After modeling $p (y)$ (the class priors) and $p (x ∣ y)$ , our algorithm can then use Bayes' Theorem to derive the posterior distribution on $y$ given $x$ :

p (y ∣ x) = \frac{p ( x ∣ y ) p ( y )}{p ( x )}

Here the denominator is given by $p (x) = p (x ∣ y = 1) p (y = 1) + p (x ∣ y = 0) p (y = 0)$ . Actually, if were calculating $p (y ∣ x)$ in order to make a prediction, then we don’t actually need to calculate the denominator, since

ar g y max p (y ∣ x) = ar g y max \frac{p ( x ∣ y ) p ( y )}{p ( x )} = ar g y max p (x ∣ y) p (y)

Gaussian Discriminant Analysis (GDA)

Multivariate normal distribution

The multivariate Gaussian distribution is parameterized by a mean vector $μ \in R^{d}$ and a covariance matrix $Σ \in R^{d \times d}$ , where $Σ \geq 0$ is symmetric and positive semi-definite. Also written $N (μ, Σ)$ , its density is given by:

p (x; μ, Σ) = \frac{1}{( 2 π ) ^{d /2} ∣ Σ ∣ ^{1/2}} exp [- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ)]

The GDA model

When we have a classification problem in which the input features $x$ are continuous-valued random variables, we can then use the Gaussian Discriminant Analysis (GDA) model, which models $p (x ∣ y)$ using a multivariate normal distribution. The model is:

y x ∣ y = 0 x ∣ y = 1 \sim Bernoulli (ϕ) \sim N (μ_{0}, Σ) \sim N (μ_{1}, Σ)

where "Bernoulli" denotes the Bernoulli Distribution. Here, the parameters of our model are $ϕ, Σ, μ_{0}$ and $μ_{1}$ . (Note that while there’re two different mean vectors $μ_{0}$ and $μ_{1}$ , this model is usually applied using only one covariance matrix $Σ$ .) The log-likelihood of the data is given by

ℓ (ϕ, μ_{0}, μ_{1}, Σ) = lo g i = 1 \prod n p (x^{(i)}, y^{(i)}; ϕ, μ_{0}, μ_{1}, Σ) = lo g i = 1 \prod n p (x^{(i)} ∣ y^{(i)}; ϕ, μ_{0}, μ_{1}, Σ) p (y^{(i)}; ϕ)

By maximizing $ℓ$ with respect to the parameters, we find the maximum likelihood estimate of the parameters to be:

ϕ μ_{0} μ_{1} Σ = \frac{1}{n} i = 1 \sum n 1 {y^{(i)} = 1} = \frac{\sum _{i = 1}^{n} 1 { y ^{(i)} = 0 } x ^{(i)}}{\sum _{i = 1}^{n} 1 { y ^{(i)} = 0 }} = \frac{\sum _{i = 1}^{n} 1 { y ^{(i)} = 1 } x ^{(i)}}{\sum _{i = 1}^{n} 1 { y ^{(i)} = 1 }} = \frac{1}{n} i = 1 \sum n (x^{(i)} - μ_{y^{(i)}}) (x^{(i)} - μ_{y^{(i)}})^{T} .

Pictorially, what the algorithm is doing can be seen in as follows: Shown in the figure are the training set, as well as the contours of the two Gaussian distributions that have been fit to the data in each of the two classes. Note that the two Gaussians have contours that are the same shape and orientation, since they share a covariance matrix $Σ$ , but they have different means $μ_{0}$ and $μ_{1}$ .

Discussion

Form of $p (y = 1∣ x)$ : If $p (x ∣ y)$ is a multivariate Gaussian (with shared $Σ$ ), then $p (y ∣ x)$ necessarily follows a logistic function. Specifically, $p (y = 1∣ x; ϕ, Σ, μ_{0}, μ_{1}) = \frac{1}{1 + e x p ( - θ ^{T} x )}$ , which is the form logistic regression uses.
Modeling Assumptions: GDA makes stronger assumptions about the data (specifically, that $p (x ∣ y)$ is Gaussian) than logistic regression. Logistic regression only assumes that $p (y ∣ x)$ takes the form of a logistic function, which can arise from various distributions of $x ∣ y$ (e.g., Gaussian, Poisson).
Data Efficiency and Accuracy: When GDA's modeling assumptions are correct (or approximately correct), GDA is more data-efficient (requires less training data) and can find better fits, especially with small training datasets. In the limit of very large training sets, GDA is asymptotically efficient.
Non-Gaussian Data: If the data is non-Gaussian, logistic regression will often perform better than GDA in the limit of large datasets. If we were to use GDA on such data—and fit Gaussian distributions to such non-Gaussian data—then the results will be less predictable, and GDA may (or may not) do well.

Lin's Notes Garden

Explorer

Generative Learning Algorithms

Motivation

Gaussian Discriminant Analysis (GDA)

Multivariate normal distribution

The GDA model

Discussion

Graph View

Table of Contents

Backlinks