ERM stands for Empirical Risk Minimization

Empirical Risk Minimization

A learning algorithm receives as input a training set , sampled from an unknown distribution and labeled by some target function , and should output a predictor (where the subscript indicates the fact that the output predictor depends on ).

The goal of the algorithm is to find that minimizes the error with respect to the unknown and . Since the learner does not know what and are, the true error is not directly available to the learner. Instead, we could calculate the training error over the training sample:

where

The term empirical error and empirical risk are often used interchangeably for this error. And this learning paradigm coming up with a predictor that minimizes is called Empirical Risk Minimization or ERM for short.

Empirical Risk Minimization with Inductive Bias

However, learning to fit the training sample may lead to overfitting. A common solution to avoid this issue is to apply the ERM learning rule over a restricted search space. Formally, the learner should choose in advance (before seeing the data) a set of predictors. This set is called a hypothesis class and is denoted by . Each is a function mapping from to . For a given class , and a training sample , the learner uses the ERM rule to choose a predictor with the lowest possible error. Formally,

where stands for the set of hypotheses in that achieve the minimum value of over (namely the argument that achieve the minimum value). By restricting the learner to choosing a predictor from , we bias it toward a particular set of predictors. Such restrictions are often call an inductive bias. Sine the choice of such a restriction is determined before the learner sees the training data, it should ideally be based on some prior knowledge about the problem to be learned.

Finite Hypothesis Classes

The simplest type of restriction on a class is imposing an upper bound on its size (that is, the number of predictors in ). In this section, we show that if is a finite class then will not overfit on a sufficiently large training sample.

Let us now analyze the performance of the learning rule assuming that is a finite class.

For a training sample, , labeled according to some , let denote a result of applying to S, namely,

And we make this Realizability Assumption

Realizability Assumption

There exists , where is the true risk of This assumption implies that with probability over random samples, S, where the instances of are sampled according to and labeled by , we have Clearly, any guarantee on the error with respect to the underlying distribution, , for an algorithm that has access only to s sample should depend on the relationship between and . The common assumption in statistical machine learning is that the training sample is generated by sampling points from the distribution independently of each other. Formally, we make The i.i.d. Assumption

i.i.d. Assumption

The examples in the training set are independently and identically distributed (i.i.d.) according to the distribution . That is, every in is freshly sampled according to the distribution and then labeled according to the labeling function, . We denote this assumption by where is the size of , and denotes the probability over -tuples induced by applying to pick each element of the tuple independently of the other members of the tuple.

Since depends on the training set, , and that training set is picked by a random process, there is randomness in the choice of the predictor and, consequently, in the risk . Formally, we say that it is a random variable. Since we cannot guarantee perfect label prediction, we introduce another parameter for the quality of prediction, the accuracy parameter, commonly denoted by . We interpret the event as a failure of the learner, while if we view the output of the algorithm as an approximately correct predictor. Therefore we are interested in upper bounding the probability to sample -tuple of instances that will lead to failure of the learner. Formally, let be the instances of the training set. We would like to upper bound

Let be the set of "bad" hypotheses, that is,

In addition, let

be the set of misleading samples: Namely, for every , there is a "bad" hypothesis, , that looks like a "good" hypothesis on . Now, recall that we would like to bound the probability of the event . But, since the realizability assumption implies that , it follows that the event can only happen if for some we have . In other words, this event will only happen if our sample is in the set of misleading samples, . Formally, we have shown that

Note that we can rewrite as

Hence,