ERM stands for Empirical Risk Minimization

Empirical Risk Minimization

A learning algorithm receives as input a training set $S$ , sampled from an unknown distribution $D$ and labeled by some target function $f$ , and should output a predictor $h_{S} : X \to Y$ (where the subscript indicates the fact that the output predictor depends on $S$ ).

The goal of the algorithm is to find $h_{S}$ that minimizes the error with respect to the unknown $D$ and $f$ . Since the learner does not know what $D$ and $f$ are, the true error is not directly available to the learner. Instead, we could calculate the training error over the training sample:

L_{S} (h) : = \frac{∣ { i \in [ m ] : h ( x _{i} ) \neq = y _{i} } ∣}{m}

where $[m] = {1, \dots, m}$

The term empirical error and empirical risk are often used interchangeably for this error. And this learning paradigm coming up with a predictor $h$ that minimizes $L_{S} (h)$ is called Empirical Risk Minimization or ERM for short.

Empirical Risk Minimization with Inductive Bias

However, learning to fit the training sample may lead to overfitting. A common solution to avoid this issue is to apply the ERM learning rule over a restricted search space. Formally, the learner should choose in advance (before seeing the data) a set of predictors. This set is called a hypothesis class and is denoted by $H$ . Each $h \in H$ is a function mapping from $X$ to $Y$ . For a given class $H$ , and a training sample $S$ , the $ERM_{H}$ learner uses the ERM rule to choose a predictor $h \in H$ with the lowest possible error. Formally,

ERM_{H} (S) \in h \in H argmin L_{S} (h)

where $argmin$ stands for the set of hypotheses in $H$ that achieve the minimum value of $L_{S} (h)$ over $H$ (namely the argument that achieve the minimum value). By restricting the learner to choosing a predictor from $H$ , we bias it toward a particular set of predictors. Such restrictions are often call an inductive bias. Sine the choice of such a restriction is determined before the learner sees the training data, it should ideally be based on some prior knowledge about the problem to be learned.

Finite Hypothesis Classes

The simplest type of restriction on a class is imposing an upper bound on its size (that is, the number of predictors $h$ in $H$ ). In this section, we show that if $H$ is a finite class then $ERM_{H}$ will not overfit on a sufficiently large training sample.

Let us now analyze the performance of the $ERM_{H}$ learning rule assuming that $H$ is a finite class.

For a training sample, $S$ , labeled according to some $f : X \to Y$ , let $h_{S}$ denote a result of applying $ERM_{H}$ to S, namely,

h_{S} \in h \in H argmin L_{S} (h)

And we make this Realizability Assumption

Realizability Assumption

There exists $h^{*} \in H s.t. L_{(D, f)} (h^{*}) = 0$ , where $L_{(D, f)} (h^{*})$ is the true risk of $h^{*}$ This assumption implies that with probability $1$ over random samples, S, where the instances of $S$ are sampled according to $D$ and labeled by $f$ , we have $L_{S} (h^{*}) = 0$ Clearly, any guarantee on the error with respect to the underlying distribution, $D$ , for an algorithm that has access only to s sample $S$ should depend on the relationship between $D$ and $S$ . The common assumption in statistical machine learning is that the training sample $S$ is generated by sampling points from the distribution $D$ independently of each other. Formally, we make The i.i.d. Assumption

i.i.d. Assumption

The examples in the training set are independently and identically distributed (i.i.d.) according to the distribution $D$ . That is, every $x_{i}$ in $S$ is freshly sampled according to the distribution $D$ and then labeled according to the labeling function, $f$ . We denote this assumption by $S \sim D^{m}$ where $m$ is the size of $S$ , and $D^{m}$ denotes the probability over $m$ -tuples induced by applying $D$ to pick each element of the tuple independently of the other members of the tuple.

Since $L_{(D, f)} (h_{S})$ depends on the training set, $S$ , and that training set is picked by a random process, there is randomness in the choice of the predictor $h_{S}$ and, consequently, in the risk $L_{(D, f)} (h_{S})$ . Formally, we say that it is a random variable. Since we cannot guarantee perfect label prediction, we introduce another parameter for the quality of prediction, the accuracy parameter, commonly denoted by $ϵ$ . We interpret the event $L_{(D, f)} (h_{S}) > ϵ$ as a failure of the learner, while if $L_{(D, f)} (h_{S}) \leq ϵ$ we view the output of the algorithm as an approximately correct predictor. Therefore we are interested in upper bounding the probability to sample $m$ -tuple of instances that will lead to failure of the learner. Formally, let $S ∣_{x} = (x_{1}, \dots, x_{m})$ be the instances of the training set. We would like to upper bound

D^{m} ({S ∣_{x} : L_{(D, f)} (h_{S}) > ϵ})

Let $H_{B}$ be the set of "bad" hypotheses, that is,

H_{B} = {h \in H : L_{(D, f)} (h) > ϵ}

In addition, let

M = {S ∣_{x} : \exists h \in H_{B}, L_{S} (h) = 0}

be the set of misleading samples: Namely, for every $S ∣_{x} \in M$ , there is a "bad" hypothesis, $h \in H_{B}$ , that looks like a "good" hypothesis on $S ∣_{x}$ . Now, recall that we would like to bound the probability of the event $L_{(D, f)} (h_{S}) > ϵ$ . But, since the realizability assumption implies that $L_{S} (h_{S}) = 0$ , it follows that the event $L_{(D, f)} (h_{S}) > ϵ$ can only happen if for some $h \in H_{B}$ we have $L_{S} (h) = 0$ . In other words, this event will only happen if our sample is in the set of misleading samples, $M$ . Formally, we have shown that

{S ∣_{x} : L_{D, f} (h_{S}) > ϵ} \subseteq M

Note that we can rewrite $M$ as

M = h \in H_{B} ⋃ {S ∣_{x} : L_{S} (h) = 0}

Hence,

Lin's Notes Garden

Explorer

ERM Learning

Empirical Risk Minimization

Empirical Risk Minimization with Inductive Bias

Finite Hypothesis Classes

Realizability Assumption

i.i.d. Assumption

Graph View

Table of Contents

Backlinks