Discriminant Function

Binary Classification

g (x) = w^{T} x + w_{0}

Criteria:

x \in ω_{1} ⟺ g (x) > 0; x \in ω_{2} ⟺ g (x) < 0

Here $w^{T} x + w = 0$ is the separating surface

Loss Function

Classification

L (f, D) = \frac{1}{N} i = 1 \sum n l (f (x_{i}) \neq = y_{i})

Regression

L (f, D) = \frac{1}{N} i = 1 \sum n (f (x_{i}) - y_{i})^{2}

Least Squares Method

For linear function $f (x) = w^{T} x = x^{T} w$ , where

w = [w_{0}, w_{1}, \dots, w_{d}]^{T}, x = [1, x_{1}, \dots, x_{d}]^{T}

$x$ is the augmented sample vector.

For $n$ samples $x^{(1)}, \dots, x^{(n)}$ , we define the design matrix

X = - (x^{(1)})^{T} - - (x^{(2)})^{T} - ⋮ - (x^{(n)})^{T} -

and

y = y^{(1)} y^{(2)} ⋮ y^{(n)}

we have

X w - y = f (x^{(1)}) - y^{(1)} f (x^{(2)}) - y^{(2)} ⋮ f (x^{(n)}) - y^{(n)}

we want to minimize

E (w) = \frac{1}{N} j = 1 \sum N (f (x^{(j)}) - y^{(j)})^{2} = \frac{1}{N} (X w - y)^{T} (X w - y)

the solution is

w^{*} = (X^{T} X)^{- 1} X^{T} y

Linear Discriminant Analysis

When working with high-dimensional datasets it is important to apply dimensionality reduction techniques to make data exploration and modeling more efficient. One such technique is Linear Discriminant Analysis (LDA) which helps in reducing the dimensionality of data while retaining the most significant features for classification tasks. It works by finding the linear combinations of features that best separate the classes in the dataset.

Image shows an example where the classes (black and green circles) are not linearly separable. LDA attempts to separate them using red dashed line. It uses both axes (X and Y) to generate a new axis in such a way that it maximizes the distance between the means of the two classes while minimizing the variation within each class. This transforms the dataset into a space where the classes are better separated.

Logistic Regression

We could approach the classification problem ignoring the fact that $y$ is discrete-valued, and use our old linear regression algorithm to try to predict $y$ given $x$ . However, it is easy to construct examples where this method performs very poorly. Intuitively, it also doesn't make sense for $h_{θ} (x)$ to take values larger than $1$ or smaller than $0$ when we know that $y \in {0, 1}$ . To fix this, let's change the form for our hypotheses $h_{θ} (x)$ . We will choose

h_{θ} (x) = g (θ^{T} x) = \frac{1}{1 + e ^{- θ^{T} x}},

where

g (z) = \frac{1}{1 + e ^{- z}}

is called the logistic function or the sigmoid function.

Let us assume that

P (y = 1 ∣ x; θ) P (y = 0 ∣ x; θ) = h_{θ} (x) = 1 - h_{θ} (x)

Note that this can be written more compactly as

p (y ∣ x; θ) = (h_{θ} (x))^{y} (1 - h_{θ} (x))^{1 - y}

Assuming that the $n$ training examples were generated independently, we can then write down the likelihood of the parameters as

L (θ) = p (y ∣ X; θ) = i = 1 \prod n p (y^{(i)} ∣ x^{(i)}; θ) = i = 1 \prod n (h_{θ} (x^{(i)}))^{y^{(i)}} (1 - h_{θ} (x^{(i)}))^{1 - y^{(i)}}

As before, it will be easier to maximize the log likelihood:

ℓ (θ) = lo g L (θ) = i = 1 \sum n y^{(i)} lo g h_{θ} (x^{(i)}) + (1 - y^{(i)}) lo g (1 - h_{θ} (x^{(i)}))

Similar to our derivation in the case of linear regression, we can use gradient ascent. Written in vectorial notation, our updates will therefore be given by $θ := θ + α \nabla_{θ} ℓ (θ)$ . (Note the positive rather than negative sign in the update formula, since we're maximizing, rather than minimizing, a function now.) Let's start by working with just one training example $(x, y)$ , and take derivatives to derive the stochastic gradient ascent rule:

\frac{\partial}{\partial θ _{j}} ℓ (θ) = (y \frac{1}{g ( θ ^{T} x )} - (1 - y) \frac{1}{1 - g ( θ ^{T} x )}) \frac{\partial}{\partial θ _{j}} g (θ^{T} x) = (y \frac{1}{g ( θ ^{T} x )} - (1 - y) \frac{1}{1 - g ( θ ^{T} x )}) g (θ^{T} x) (1 - g (θ^{T} x)) \frac{\partial}{\partial θ _{j}} θ^{T} x = (y (1 - g (θ^{T} x)) - (1 - y) g (θ^{T} x)) x_{j} = (y - h_{θ} (x)) x_{j}

This therefore gives us the stochastic gradient ascent rule

θ_{j} := θ_{j} + α (y^{(i)} - h_{θ} (x^{(i)})) x_{j}^{(i)}

Perceptron

Consider modifying the logistic regression method to force it to output values that are either $0$ or $1$ or exactly. To do so, it seems natural to change the definition of $g$ to be the threshold function

g (z) = {1, 0, if z \geq 0 if z < 0

If we then let $h_{θ} (x) = g (θ^{T} x)$ as before but using this modified definition of $g$ , and if we use the update rule

θ_{j} := θ_{j} + α (y^{(i)} - h_{θ} (x^{(i)})) x_{j}^{(i)}

then we have the perceptron learning algorithm.

Multi-Class Classification

We introduce $k$ groups of parameters $θ_{1}, \dots, θ_{k}$ , each of them being a vector in $R^{d}$ . Intuitively, we would like to use $θ_{1}^{T} x, \dots, θ_{k}^{T} x$ to represent the probabilities $P (y = 1 ∣ x; θ), \dots, P (y = k ∣ x; θ)$ . However, there are two issues with such a direct approach. First, $θ_{j}^{T} x$ is not necessarily with $[0, 1]$ . Second, the summation of $θ_{j}^{T} x$ 's is not necessarily $1$ . Instead, we will define the softmax function

softmax (t_{1}, \dots, t_{k}) = \frac{exp ( t _{1} )}{\sum _{j = 1}^{k} exp ( t _{j} )} ⋮ \frac{exp ( t _{k} )}{\sum _{j = 1}^{k} exp ( t _{j} )}

The inputs to the softmax function, the vector $t$ here, are often called logits. The output of the softmax function is always a probability vector whose entries are non-negative and sum up to $1$ .

Let $(t_{1}, \dots, t_{k}) = (θ_{1}^{T} x, \dots, θ_{k}^{T} x)$ , we obtain the following probabilistic model

P (y = 1 ∣ x; θ) ⋮ P (y = k ∣ x; θ) = softmax (t_{1}, \dots, t_{k})

that is

P (y = i ∣ x; θ) = \frac{exp ( t _{i} )}{\sum _{j = 1}^{k} exp ( t _{j} )} = \frac{exp ( θ _{i}^{T} x )}{\sum _{j = 1}^{k} exp ( θ _{j}^{T} x )}

The negative log-likelihood of a single example $(x, y)$ , $y \in {1, 2, \dots, k}$ is

- lo g p (y ∣ x, θ) = - lo g (\frac{exp ( t _{y} )}{\sum _{j = 1}^{k} exp ( t _{j} )})

Thus the loss function is given as

ℓ (θ) = i = 1 \sum n - lo g \frac{exp ( θ _{y^{(i)}}^{T} x ^{(i)} )}{\sum _{j = 1}^{k} exp ( θ _{y^{(i)}}^{T} x ^{(i)} )}

It's convenient to define the cross-entropy loss $ℓ_{ce} : R^{k} \times {1, 2, \dots, k} \to R_{\geq 0}$ , which modularizes in the complex equation above

ℓ_{ce} ((t_{1}, \dots, t_{k}), y) = - lo g (\frac{exp ( t _{y} )}{\sum _{j = 1}^{k} exp ( t _{j} )})

With this notation, we can simply rewrite the total loss into

ℓ (θ) = i = 1 \sum n ℓ_{ce} ((θ_{1}^{T} x^{(i)}, \dots, θ_{k}^{T} x^{(i)}), y^{(i)})

Support Vector Machine

Support Vector Machine - SVM

Lin's Notes Garden

Explorer

Linear Models