Classifier

For parameters $w, b$ , write our classifier as

h_{w, b} (x) = g (w^{T} x + b)

Here, $g (z) = 1$ if $z \geq 0$ , and $g (z) = - 1$ otherwise.

Note

A bigger $ω^{T} x + b$ indicates a more confident prediction.

Functional Margin

Given a training example $(x^{(i)}, y^{(i)})$ , we define the functional margin of $(w, b)$ as

\overset{γ}{^}^{(i)} = y^{(i)} (ω^{T} x^{(i)} + b)

Note that if $y^{(i)} = 1$ , then for the functional margin to be large (i.e., for our prediction to be confident and correct), we need $w^{T} x^{(i)} + b$ to be a large positive number. Conversely, if $y^{(i)} = - 1$ , then for the functional margin to be large, we need $w^{T} x^{(i)} + b$ to be a large negative number. Moreover, if $y^{(i)} (w^{T} x^{(i)} + b) > 0$ , then our prediction on this example is correct. Hence, a large functional margin represents a confident and a correct prediction.

Given a training set $S = {(x^{(i)}, y^{(i)}) ∣ i = 1, 2, \dots, n}$ , we also define the function margin of $(w, b)$ with respect to $S$ as the smallest of the functional margins of the individual training examples. Denoted by $\overset{γ}{^}$ , this can therefore be written:

\overset{γ}{^} = i = 1, \dots, n min \overset{γ}{^}^{(i)}

Issues with this form

For a linear classifier with the choice of $g$ given above (taking values in ${- 1, 1}$ ), there's one property of the functional margin that makes it not a very good measure of confidence. Given our choice of $g$ , we note that if we replace $w$ with $2 w$ and $b$ with $2 b$ , then since $g (w^{T} x + b) = g (2 w^{T} x + 2 b)$ , this would not change $h_{w, b} (x)$ at all. However, replacing $(w, b)$ with $(2 w, 2 b)$ also results in multiplying our functional margin by a factor of $2$ . Thus, it seems that by exploiting our freedom to scale $w$ and $b$ , we can make the functional margin arbitrarily large without really changing anything meaningful. Intuitively, it might therefore make sense to impose some sort of normalization condition such as that $∥ w ∥_{2} = 1$ ; i.e., we might replace $(w, b)$ with $(w / ∥ w ∥_{2}, b / ∥ w ∥_{2})$ , and instead consider the functional margin of $(w / ∥ w ∥_{2}, b / ∥ w ∥_{2})$ .

Geometric Margin

Consider the picture below

The decision boundary corresponding to $(w, b)$ is shown, along with the vector $w$ . Note that $w$ is orthogonal (at 90°) to the separating hyperplane. Consider the point at A, which represents the input $x^{(i)}$ of some training example with label $y^{(i)} = 1$ . Its distance to the decision boundary, $γ^{(i)}$ , is given by the line segment AB.

Since $B$ is on the decision boundary $w^{T} x + b = 0$ , we have

w^{T} (x^{(i)} - γ^{(i)} \frac{w}{∥ w ∥}) + b = 0

Solving for $γ^{(i)}$ yields

γ^{(i)} = (\frac{w}{∥ w ∥})^{T} x^{(i)} + \frac{b}{∥ w ∥}

More generally, we define the geometric margin of $(w, b)$ with respect to a training example $(x^{(i)}, y^{(i)})$ to be

γ^{(i)} = y^{(i)} [(\frac{w}{∥ w ∥})^{T} x^{(i)} + \frac{b}{∥ w ∥}]

Note that if $∥ w ∥ = 1$ , then the functional margin equals the geometric margin.

The Optimal Margin Classifier

Note

For now, we will assume that we are given a training set that is linearly separable.

Form 1

To maximize the geometric margin, we could pose the following optimization problem

γ, w, b max s.t. γ y^{(i)} (w^{T} x^{(i)} + b) \geq γ, i = 1, 2, \dots, n ∥ w ∥ = 1

If we could solve the optimization problem above, we'd be done. But the " $∥ w ∥ = 1$ " constraint is a nasty (non-convex) one, and this problem certainly isn't in any format that we can plug into standard optimization software to solve.

Form 2

So, let's try transforming the problem into a nicer one. Consider:

\overset{γ}{^}, w, b max s.t. \frac{γ ^}{∥ w ∥} y^{(i)} (w^{T} x^{(i)} + b) \geq \overset{γ}{^}, i = 1, 2, \dots, n

Here, we're going to maximize $\overset{γ}{^} /∥ w ∥$ , subject to the functional margins all being at least $\overset{γ}{^}$ . Since the geometric and functional margins are related by $γ = \overset{γ}{^} /∥ w ∥$ , this will give us the answer we want. Moreover, we've gotten rid of the constraint $∥ w ∥ = 1$ that we didn't like. The downside is that we now have a nasty (again, non-convex) objective $\frac{γ ^}{∥ w ∥}$ function.

Form 3

Recall our earlier discussion that we can add an arbitrary scaling constraint on $w$ and $b$ without changing anything. Therefore, we introduce the scaling constraint that the functional margin of $w, b$ with respect to the training set must be $1$ :

\overset{γ}{^} = 1

This restriction can be satisfied by rescaling $w, b$ . Noting that maximizing $\overset{γ}{^} /∥ w ∥ = 1/∥ ω ∥$ is the same thing minimizing $∥ w ∥^{2}$ , we now have the following optimization problem:

w, b min s.t. \frac{1}{2} ∥ w ∥^{2} y^{(i)} (w^{T} x^{(i)} + b) \geq 1, i = 1, 2, \dots, n

We've now transformed the problem into a form that can be efficiently solved.

Lagrange Duality

Primal Problem

Consider the following, which we'll call the primal optimization problem:

w min s.t. f (w) g_{i} (w) \leq 0, i = 1, \dots, k h_{i} (w) = 0, i = 1, \dots, l

To solve it, we start by defining the generalized Lagrangian

L (w, α, β) = f (w) + i = 1 \sum k α_{i} g_{i} (w) + i = 1 \sum l β_{i} h_{i} (w)

Here, the $α_{i}$ 's and $β_{i}$ 's are the Lagrange multipliers. Consider the quantity

θ_{P} (w) = α, β : α \geq 0 max L (w, α, β) .

Here, the " $P$ " subscript stands for "primal." Let some $w$ be given. If $w$ violates any of the primal constraints (i.e., if either $g_{i} (w) > 0$ or $h_{i} (w) \neq = 0$ for some $i$ ), then you should be able to verify that

θ_{P} (w) = α, β : α_{i} \geq 0 max f (w) + i = 1 \sum k α_{i} g_{i} (w) + i = 1 \sum l β_{i} h_{i} (w) = \infty

Conversely, if the constraints are indeed satisfied for a particular value of $w$ , then $θ_{P} (w) = f (w)$ . Hence,

θ_{P} (w) = {f (w) \infty if w satisfies primal constraints. otherwise.

Thus, $θ_{P}$ takes the same value as the objective in our problem for all values of w that satisfies the primal constraints, and is positive infinity if the constraints are violated. Hence, if we consider the minimization problem

w min θ_{P} (w) = w min α, β : α_{i} \geq 0 max L (w, α, β),

we see that it is the same problem (i.e., and has the same solutions as) our original, primal problem. For later use, we also define the optimal value of the objective to be $p^{*} = min_{w} θ_{P} (w)$ ; we call this the value of the primal problem.

Dual Problem

Now, let's look at a slightly different problem. We define

θ_{D} (α, β) = w min L (w, α, β) .

Here, the " $D$ " subscript stands for "dual." Note also that whereas in the definition of $θ_{P}$ we were optimizing (maximizing) with respect to $α, β$ , here we are minimizing with respect to $w$ .

We can now pose the dual optimization problem:

α, β : α_{i} \geq 0 max θ_{D} (α, β) = α, β : α_{i} \geq 0 max w min L (w, α, β) .

This is exactly the same as our primal problem shown above, except that the order of the "max" and the "min" are now exchanged. We also define the optimal value of the dual problem's objective to be $d^{*} = max_{α, β : α_{i} \geq 0} θ_{D} (w)$ .

How are the primal and the dual problems related? It can easily be shown that $d^{*} = max_{α, β : α_{i} \geq 0} min_{w} L (w, α, β) \leq min_{w} max_{α, β : α_{i} \geq 0} L (w, α, β) = p^{*} .$ this follows from the "max min" of a function always being less than or equal to the "min max." However, under certain conditions, we will have

d^{*} = p^{*},

so that we can solve the dual problem in lieu of the primal problem.

Karush-Kuhn-Tucker (KKT) conditions

Assumptions Suppose $f$ and the $g_{i}$ 's are convex, and the $h_{i}$ 's are affine (i.e. there exists $a_{i}, b_{i}$ so that $h_{i} (w) = a_{i}^{T} w + b_{i}$ ). Suppose further that constraints $g_{i}$ are (strictly) feasible (i.e. there exists some $w$ so that $g_{i} (w) < 0$ for all $i$ ).

Claim There must exist $w^{*}, α^{*}, β^{*}$ so that $w^{*}$ is the solution to the primal problem, $α^{*}, β^{*}$ are the solution to the dual problem, and moreover $p^{*} = d^{*} = L (w^{*}, α^{*}, β^{*})$ .

and $w^{*}, α^{*}, β^{*}$ satisfy the Karush-Kuhn-Tucker (KKT) conditions:

\frac{\partial}{\partial w _{i}} L (w^{*}, α^{*}, β^{*}) \frac{\partial}{\partial β _{i}} L (w^{*}, α^{*}, β^{*}) α_{i}^{*} g_{i} (w^{*}) g_{i} (w^{*}) α^{*} = 0, i = 1, \dots, d = 0, i = 1, \dots, l = 0, i = 1, \dots, k \leq 0, i = 1, \dots, k \geq 0, i = 1, \dots, k

The Dual Form of SVM

Support Vectors

Previously, we posed the following (primal) optimization problem for finding the optimal margin classifier

w, b min s.t. \frac{1}{2} ∥ w ∥^{2} y^{(i)} (w^{T} x^{(i)} + b) \geq 1, i = 1, 2, \dots, n

We can write the constraints as

g_{i} (w) = - y^{(i)} (w^{T} x^{(i)} + b) + 1 \leq 0

We have one such constraint for each training example. Note that from the KKT dual complementarity condition, we will have $α_{i} > 0$ only for the training examples that have functional margin exactly equal to one (i.e., the ones corresponding to constraints that hold with equality, $g_{i} (w) = 0$ ). Consider the figure below, in which a maximum margin separating hyperplane is shown by the solid line. The points with the smallest margins are exactly the ones closest to the decision boundary; here, these are the three points (one negative and two positive examples) that lie on the dashed lines parallel to the decision boundary. Thus, only three of the $α_{i}$ ’s—namely, the ones corresponding to these three training examples—will be non-zero at the optimal solution to our optimization problem. These three points are called the support vectors in this problem.

Find the dual problem

Now we construct the Lagrangian for SVM, we have

L (w, b, α) = \frac{1}{2} ∥ w ∥^{2} - i = 1 \sum n α_{i} [y^{(i)} (w^{T} x^{(i)} + b) - 1]

Note that there're only " $α_{i}$ " but no " $β_{i}$ " Lagrange multipliers, since the problem has only inequality constraints. To find the dual form of the problem, we need to first minimize $L (w, b, α)$ with respect to $w$ and $b$ (for fixed $α$ ), to get $θ_{D}$ , via setting the derivatives of $L$ with respect to $w$ and $b$ to zero

⎩ ⎨ ⎧ \nabla_{w} L (w, b, α) = w - i = 1 \sum n α_{i} y^{(i)} x^{(i)} = 0 ⟹ w = i = 1 \sum n α_{i} y^{(i)} x^{(i)} \frac{\partial}{\partial b} L (w, b, α) = i = 1 \sum n α_{i} y^{(i)} = 0.

Taking these results into the original Lagrangian

∥ w ∥^{2} = w^{T} w = (i = 1 \sum n α_{i} y^{(i)} x^{(i)})^{T} (j = 1 \sum n α_{j} y^{(j)} x^{(j)}) = i, j = 1 \sum n α_{i} α_{j} y^{(i)} y^{(j)} (x^{(i)})^{T} x^{(j)}

and

i = 1 \sum n α_{i} y^{(i)} (w^{T} x^{(i)} + b) = i = 1 \sum n α_{i} y^{(i)} (j = 1 \sum n α_{j} y^{(j)} x^{(j)})^{T} x^{(i)} + b = i = 1 \sum n α_{i} y^{(i)} (j = 1 \sum n α_{j} y^{(j)} (x^{(j)})^{T} x^{(i)} + b) = i, j = 1 \sum n α_{i} α_{j} y^{(i)} y^{(j)} (x^{(i)})^{T} x^{(j)} + b i = 1 \sum n α_{i} y^{(i)}

With $\sum_{i = 1}^{n} α_{i} y^{(i)} = 0$ , we get the final result

L (w, b, α) = i = 1 \sum n α_{i} - \frac{1}{2} i, j = 1 \sum n α_{i} α_{j} y^{(i)} y^{(j)} (x^{(i)})^{T} x^{(j)}

Putting this together with the constraints $α_{i} \geq 0$ (that we always had) and the constraint of $α_{i} y^{i}$ , we obtain the following dual optimization problem

α max s.t. L (α) = i = 1 \sum n α_{i} - \frac{1}{2} i, j = 1 \sum n α_{i} α_{j} y^{(i)} y^{(j)} (x^{(i)})^{T} x^{(j)} α_{i} \geq 0, i = 1, 2, \dots, n i = 1 \sum n α_{i} y^{(i)} = 0

Regularization and the Non-separable Case

The derivation of the SVM as presented so far assumed that the data is linearly separable. While mapping data to a high dimensional feature space via $ϕ$ does generally increase the likelihood that the data is separable, we can't guarantee that it always will be so. Also, in some cases it is not clear that finding a separating hyperplane is exactly what we'd want to do, since that might be susceptible to outliers. For instance, the left figure below shows an optimal margin classifier, and when a single outlier is added in the upper-left region (right figure), it causes the decision boundary to make a dramatic swing, and the resulting classifier has a much smaller margin.

To make the algorithm work for non-linearly separable datasets as well as be less sensitive to outliers, we reformulate our optimization (using $ℓ_{1}$ regularization) as follows:

ξ, w, b min s.t. \frac{1}{2} ∥ w ∥^{2} + C i = 1 \sum n ξ_{i} y^{(i)} (w^{T} x^{(i)} + b) \geq 1 - ξ_{i}, i = 1, 2, \dots, n ξ_{i} \geq 0, i = 1, 2, \dots, n

Thus, examples are now permitted to have (functional) margin less than $1$ , and if an example has functional margin $1 - ξ_{i}$ (with $ξ_{i} > 0$ ), we would pay a cost of the objective function being increased by $C ξ_{i}$ . The parameter $C$ controls the relative weighting between the twin goals of making the $∥ w ∥^{2}$ small (which we saw earlier makes the margin large) and of ensuring that most examples have functional margin at least $1$ .

Lin's Notes Garden

Explorer

Support Vector Machine - SVM

Classifier

Functional Margin

Issues with this form

Geometric Margin

The Optimal Margin Classifier

Form 1

Form 2

Form 3

Lagrange Duality

Primal Problem

Dual Problem

Karush-Kuhn-Tucker (KKT) conditions

The Dual Form of SVM

Support Vectors

Find the dual problem

Regularization and the Non-separable Case

Graph View

Table of Contents

Backlinks