Feature Maps

Considering fitting cubic functions $y = θ_{3} x^{3} + θ_{2} x^{2} + θ_{1} x + θ_{0}$ , we can view the cubic function as a linear function over a different set of feature variables. Concretely, let the function $ϕ : R \to R^{4}$ be defined as

ϕ (x) = 1 x x^{2} x^{3} \in R^{4}

Let $θ \in R^{4}$ be the vector containing $θ_{0}, \dots, θ_{3}$ as entries. Then we can rewrite the cubic function in $x$ as

y = θ^{T} ϕ (x)

Thus, a cubic function of the variable $x$ can be viewed as a linear function over the variables $ϕ (x)$ . To distinguish between these two sets of variables, in the context of kernel methods, we will call the “original” input value the input attributes of a problem. When the original input is mapped to some new set of quantities $ϕ (x)$ , we will call those new quantities the features variables. We will can $ϕ$ a feature map, which maps the attributes to the features.

LMS with the kernel trick

The update rule of Least Squares Method in linear case is

θ \leftarrow θ + α i = 1 \sum n (y^{(i)} - θ^{T} x^{(i)}) x^{(i)}

If we use a feature map function $ϕ (x)$

θ \leftarrow θ + α i = 1 \sum n (y^{(i)} - θ^{T} ϕ (x^{(i)})) ϕ (x^{(i)})

This update becomes computationally expensive when the features $ϕ (x)$ is high-dimensional. Let $ϕ (x)$ be the vector that contains all the monomials of $x_{1}, x_{2}, x_{3}$ with degree $\leq d$ , then the dimension of $ϕ (x)$ would be on the order of $d^{3}$ .

It may appear at first that such $d^{3}$ runtime per update and memory usage are inevitable, because the vector $θ$ itself is of dimension $p \approx d^{3}$ , and we may need to update every entry of $θ$ and store it. However, we will introduce the kernel trick with which we will not need to store $θ$ explicitly, and the runtime can be significantly improved.

For simplicity, we assume the initialize the value $θ = 0$ . At initialization, $θ = 0 = \sum_{i = 1}^{n} 0 \cdot ϕ (x^{(i)})$ . Assume at some point, $θ$ can be represented as

θ = i = 1 \sum n β_{i} ϕ (x^{(i)})

for some $β_{1}, \dots, β_{n} \in R$ . Then we claim that in the next round, $θ$ is still a linear combination of $ϕ (x^{(1)}), \dots, ϕ (x^{(n)})$ because

θ := θ + α i = 1 \sum n (y^{(i)} - θ^{T} ϕ (x^{(i)})) ϕ (x^{(i)}) = i = 1 \sum n (β_{i} + α y^{(i)} - α θ^{T} ϕ (x^{(i)})) ϕ (x^{(i)})

Our general strategy is to implicitly represent the $p$ -dimensional vector $θ$ by a set of coefficients $β_{1}, \dots, β_{n}$ . Towards doing this, we derive the update rule of the coefficients $β_{1}, \dots, β_{n}$ . Using the equation above, we see that the new $β_{i}$ depends on the old one via

β_{i} \leftarrow β_{i} + α (y^{(i)} - θ^{T} ϕ (x^{(i)}))

Here we still have the old $θ$ on the RHS of the equation. Replacing $θ$ by $θ = \sum_{i = 1}^{n} β_{i} ϕ (x^{(i)})$ gives

\forall i \in {1, \dots, n}, β_{i} := β_{i} + α (y^{(i)} - j = 1 \sum n β_{i} ϕ (x^{(j)})^{T} ϕ (x^{(i)}))

We often rewrite $ϕ (x^{(j)})^{T} ϕ (x^{(i)})$ as $⟨ ϕ (x^{(j)}), ϕ (x^{(i)}) ⟩$ to emphasize that it's the inner product of the two feature vectors. It may appear that at every iteration, we still need to compute the values of $⟨ ϕ (x^{(j)}), ϕ (x^{(i)}) ⟩$ for all pairs of $i, j$ , each of which may take roughly $O (p)$ operation. However, two important properties come to rescue:

We ca pre-compute the pairwise inner products $⟨ ϕ (x^{(j)}), ϕ (x^{(i)}) ⟩$ for all pairs $i, j$ before the loop starts
Compute the inner product can be efficient and does not necessarily require computing $ϕ (x^{(i)})$ explicitly. This is because

⟨ ϕ (x), ϕ (z) ⟩ = 1 + \sum x_{i} z_{i} + \sum x_{i} x_{j} z_{i} z_{j} + \sum x_{i} x_{j} x_{k} z_{i} z_{j} z_{k} + \dots + = 1 + \sum x_{i} z_{i} + (\sum x_{i} z_{i})^{2} + (\sum x_{i} z_{i})^{3} = 1 + ⟨ x, z ⟩ + ⟨ x, z ⟩^{2} + ⟨ x, z ⟩^{3} + \dots

Therefore, to compute $⟨ ϕ (x), ϕ (z) ⟩$ , we can first compute $⟨ x, z ⟩$ with $O (d)$ time and then take another constant number of operations to compute $1 + ⟨ x, z ⟩ + ⟨ x, z ⟩^{2} + \dots +$ .

We define the Kernel corresponding to the feature map $ϕ$ as a function that maps $X \times X \to R$ satisfying:

K (x, z) := ⟨ ϕ (x), ϕ (z) ⟩

Now we can pre-compute the values $K (x^{(i)}, x^{(j)})$ , and define $K \in R^{n \times n}$ as $K_{ij} = K (x^{(i)}, x^{(j)})$ , we have the new update rule

β \leftarrow β + α (y - K β)

Properties of Kernels

AI Summary

这段文字的核心思想是介绍机器学习中的 核方法 (Kernel Methods), 特别是 核技巧 (Kernel Trick) 的概念和动机.

起点: 特征映射与核函数
- 我们通常从一个明确的 特征映射 (feature map) $ϕ$ 开始. 这个函数将原始输入数据 $x$ 转换到一个 (可能更高维度的) 特征空间 $ϕ (x)$ .
- 基于这个特征映射, 可以定义一个 核函数 (kernel function) $K (x, z)$ , 它计算两个原始输入 $x$ 和 $z$ 在特征空间中的 内积 (inner product): $K (x, z) ≜ ⟨ ϕ (x), ϕ (z)⟩$ . 在实数空间中, 这通常就是点积 $ϕ (x)^{T} ϕ (z)$ .
核技巧 (Kernel Trick) 的关键
- 很多机器学习算法 (比如支持向量机SVM的训练和预测过程, 实际上只需要用到特征向量之间的内积, 而不需要特征向量 $ϕ (x)$ 本身.
- 这意味着, 只要我们能计算核函数 $K (x, z)$ 的值, 我们就可以完全用 $K$ 来表达整个算法, 而 无需显式地计算或知道 特征映射 $ϕ$ 是什么.
动机: 直接定义核函数
- 既然算法只需要 $K$ , 那我们能不能反过来, 不先定义 $ϕ$ , 而是直接 选择或设计 一个函数 $K (x, z)$ 来用呢?
- 这样做的好处是巨大的: 我们可能选择一些 $K$ , 它们对应的 $ϕ$ 非常复杂, 甚至是无限维的, 我们根本无法显式写出或计算 $ϕ (x)$ . 但只要我们能计算 $K (x, z)$ , 算法依然可以运行.
- 但是, 有一个前提: 我们必须确保我们选择的这个 $K (x, z)$ 确实是某个特征映射 $ϕ$ 在特征空间中的内积. 也就是说, 必须存在这样一个 $ϕ$ , 即使我们不知道它是什么.
问题: 什么样的函数K是有效的核函数?
- 核心问题来了: 什么样的函数 $K (x, z)$ 可以保证存在一个特征映射 $ϕ$ 使得 $K (x, z) = ⟨ ϕ (x), ϕ (z)⟩$ 成立?
- 如果能回答这个问题, 我们就可以安全地选择一个满足条件的 $K$ , 然后直接在算法中使用它, 享受核技巧带来的便利 (比如计算效率, 处理非线性问题的能力) .
例子: 多项式核 (Polynomial Kernel)
- $K (x, z) = (x^{T} z)^{2}$ : 通过展开, 发现它等于 $\sum_{i, j = 1}^{d} (x_{i} x_{j}) (z_{i} z_{j})$ . 这明确展示了它对应一个特征映射 $ϕ (x)$ , 其分量是所有 $x_{i} x_{j}$ 的组合. 计算 $K (x, z)$ 只需要 $O (d)$ 时间, 而计算 $ϕ (x)$ 需要 $O (d^{2})$ 时间.
- $K (x, z) = (x^{T} z + c)^{2}$ : 类似地, 展开后发现它对应的 $ϕ (x)$ 包含 $x_{i} x_{j}$ 项、 $2 c x_{i}$ 项和常数项 $c$ .
- $K (x, z) = (x^{T} z + c)^{k}$ : 推广到 $k$ 次多项式核. 它对应的特征空间维度是 $O (d^{k})$ , 但计算 $K (x, z)$ 仍然只需要 $O (d)$ 时间. 这极大地体现了核技巧的 计算效率优势.
核函数的直观理解: 相似度度量
- 从直观上看, $K (x, z) = ϕ (x)^{T} ϕ (z)$ 可以被看作是衡量 $ϕ (x)$ 和 $ϕ (z)$ 之间相似度的一种方式 (内积越大, 向量越接近或方向越一致) . 因此, $K (x, z)$ 也可以被看作是衡量原始输入 $x$ 和 $z$ 之间的一种 (通过 $ϕ$ 映射后的) 相似度.
- 这启发我们可以基于对问题 "相似性" 的理解来设计核函数.
例子: 高斯核 (Gaussian Kernel)
- $K (x, z) = exp (- \frac{∣∣ x - z ∣ ∣ ^{2}}{2 σ ^{2}})$ . 当 $x$ 和 $z$ 距离近时, $K$ 接近 1; 当距离远时, $K$ 接近 0. 这符合相似度的直观概念.
- 文中提到, 高斯核确实是一个有效的核函数, 但它对应的特征映射 $ϕ$ 是 无限维 的. 这再次强调了我们无法显式使用 $ϕ$ , 而必须依赖核函数 $K$ 本身.
有效核函数的必要条件
- 假设 $K$ 是一个有效的核函数, 即 $K (x, z) = ⟨ ϕ (x), ϕ (z)⟩$ . 它必须满足什么性质?
- 考虑任意 $n$ 个数据点 ${x^{(1)}, ..., x^{(n)}}$ , 构造一个 $n \times n$ 的 核矩阵 (Kernel Matrix) $K$ , 其中 $K_{ij} = K (x^{(i)}, x^{(j)})$ .
- 对称性 (Symmetry): 因为内积是对称的 $⟨ u, v ⟩ = ⟨ v, u ⟩$ , 所以 $K_{ij} = ⟨ ϕ (x^{(i)}), ϕ (x^{(j)})⟩ = ⟨ ϕ (x^{(j)}), ϕ (x^{(i)})⟩ = K_{ji}$ . 因此, 核矩阵 $K$ 必须是对称的.
- 半正定性 (Positive Semi-Definiteness, PSD): 对于任意向量 $z \in R^{n}$ , 推导表明 $z^{T} Kz = \sum_{k} (\sum_{i} z_{i} ϕ_{k} (x^{(i)}))^{2} \geq 0$ . 这意味着核矩阵 $K$ 必须是半正定的.
- 结论: 如果一个函数 $K (x, z)$ 是有效的核函数, 那么对于任何有限点集 ${x^{(1)}, ..., x^{(n)}}$ , 由它生成的核矩阵 $K$ 必须是 对称且半正定的. 这是成为有效核函数的 必要条件.

Mercer 定理 (Mercer's Theorem) 这段材料推导了有效核函数的必要条件 (对称性和半正定性) , 但没有说明这是否也是 充分条件. Mercer 定理正是回答了这个问题的关键定理.

简单来说, Mercer 定理陈述了:

设 $X$ 是一个紧凑的度量空间 (例如 $R^{d}$ 中的一个有界闭集) , $K : X \times X \to R$ 是一个连续的、对称的函数. 那么, $K$ 是一个 正定核 (positive definite kernel) (即对于任何有限点集 ${x^{(1)}, ..., x^{(n)}} \subset X$ 和任何非零实数 $c_{1}, ..., c_{n}$ , 都有 $\sum_{i = 1}^{n} \sum_{j = 1}^{n} c_{i} c_{j} K (x^{(i)}, x^{(j)}) \geq 0$ , 注意这里用的是广义的半正定概念) 当且仅当 存在一个希尔伯特空间 (Hilbert Space) $H$ 和一个映射 $ϕ : X \to H$ , 使得对于所有的 $x, z \in X$ , 都有: $K (x, z) = ⟨ ϕ (x), ϕ (z) ⟩_{H}$ 并且 $K$ 可以展开为一致收敛的级数: $K (x, z) = \sum_{k = 1}^{\infty} λ_{k} ψ_{k} (x) ψ_{k} (z)$ 其中 $λ_{k} \geq 0$ 是 $K$ 对应的积分算子的特征值, $ψ_{k} (x)$ 是对应的特征函数.

Mercer 定理的关键意义:

充分性: 它告诉我们, 如果一个函数 $K (x, z)$ (在适当的条件下, 如连续、对称) 能够确保对任何数据点集产生的核矩阵都是 半正定的, 那么这个函数就一定是一个有效的核函数. 即, 一定存在 一个特征空间和一个映射 $ϕ$ , 使得 $K (x, z)$ 是该空间中的内积.
理论保证: 这为 "核技巧" 提供了坚实的数学基础. 只要我们选择或设计的函数 $K$ 满足 Mercer 定理的条件 (主要是对称性和产生半正定核矩阵) , 我们就可以放心地在算法中使用它, 即使我们不知道 $ϕ$ 是什么.
连接: 材料中推导的是 "如果 $K$ 是核函数 $⟹$ 核矩阵 $K$ 是对称半正定的" (必要性) . Mercer 定理说明了 "如果核矩阵 $K$ 总是对称半正定的 $⟹ K$ 是核函数" (充分性, 在一定条件下) .

总结: 材料解释了为什么我们想直接使用核函数 $K$ 而不是特征映射 $ϕ$ (核技巧) , 并推导了有效核函数必须满足的条件 (对称性和半正定性) . Mercer 定理则提供了判别一个函数是否为有效核函数的 充分条件, 即检查它生成的核矩阵是否总是半正定的. 这使得我们可以直接设计和使用满足条件的核函数, 极大地扩展了线性算法 (通过隐式映射到高维空间) 处理非线性问题的能力.

Lin's Notes Garden

Explorer

Kernel Methods

Feature Maps

LMS with the kernel trick

Properties of Kernels

AI Summary

Graph View

Table of Contents

Backlinks