Discriminant Function

Binary Classification

Criteria:

Here is the separating surface

Loss Function

Classification

Regression

Least Squares Method

For linear function , where

is the augmented sample vector.

For samples , we define the design matrix

and

we have

we want to minimize

the solution is

Linear Discriminant Analysis

When working with high-dimensional datasets it is important to apply dimensionality reduction techniques to make data exploration and modeling more efficient. One such technique is Linear Discriminant Analysis (LDA) which helps in reducing the dimensionality of data while retaining the most significant features for classification tasks. It works by finding the linear combinations of features that best separate the classes in the dataset.

Image shows an example where the classes (black and green circles) are not linearly separable. LDA attempts to separate them using red dashed line. It uses both axes (X and Y) to generate a new axis in such a way that it maximizes the distance between the means of the two classes while minimizing the variation within each class. This transforms the dataset into a space where the classes are better separated.

Logistic Regression

We could approach the classification problem ignoring the fact that is discrete-valued, and use our old linear regression algorithm to try to predict given . However, it is easy to construct examples where this method performs very poorly. Intuitively, it also doesn't make sense for to take values larger than or smaller than when we know that . To fix this, let's change the form for our hypotheses . We will choose

where

is called the logistic function or the sigmoid function.

Let us assume that

Note that this can be written more compactly as

Assuming that the training examples were generated independently, we can then write down the likelihood of the parameters as

As before, it will be easier to maximize the log likelihood:

Similar to our derivation in the case of linear regression, we can use gradient ascent. Written in vectorial notation, our updates will therefore be given by . (Note the positive rather than negative sign in the update formula, since we're maximizing, rather than minimizing, a function now.) Let's start by working with just one training example , and take derivatives to derive the stochastic gradient ascent rule:

This therefore gives us the stochastic gradient ascent rule

Perceptron

Consider modifying the logistic regression method to force it to output values that are either or or exactly. To do so, it seems natural to change the definition of to be the threshold function

If we then let as before but using this modified definition of , and if we use the update rule

then we have the perceptron learning algorithm.

Multi-Class Classification

We introduce groups of parameters , each of them being a vector in . Intuitively, we would like to use to represent the probabilities . However, there are two issues with such a direct approach. First, is not necessarily with . Second, the summation of 's is not necessarily . Instead, we will define the softmax function

The inputs to the softmax function, the vector here, are often called logits. The output of the softmax function is always a probability vector whose entries are non-negative and sum up to .

Let , we obtain the following probabilistic model

that is

The negative log-likelihood of a single example , is

Thus the loss function is given as

It's convenient to define the cross-entropy loss , which modularizes in the complex equation above

With this notation, we can simply rewrite the total loss into

Support Vector Machine

Support Vector Machine - SVM