Kernel Methods

Support Vector Machines


Kernel is a generalized dot product

Kernel Methods

  • data items are mapped into some high-dimensional space where we use Inner Product for constructing models
  • kernel acts as an "interface" between the model and the high-dimensional space
  • by defining (often implicitly) a mapping function that transform the input space to some (possibly very high dimensional) feature space

Just dot product is enough for many algorithms:



Use of Kernels

Classification

Support Vector Machines

Text Mining

Latent Semantic Kernels: Use Latent Semantic Analysis

Clustering Analysis

Regression

Support Vector Regression

  • Smola, Alex J., and Bernhard Schölkopf. "A tutorial on support vector regression." 2004. [1]


Anomaly Detection

Gaussian Processes

Kernel Density Estimation

Kernels

Data Span Solution

Choosing a kernel $\equiv$ choosing a feature space

  • $k(\mathbf x, \mathbf z) = \langle \varphi(\mathbf x), \varphi(\mathbf z) \rangle$
  • given a dataset, apply $k$ to each pair and get a Kernel Matrix (also called Gram Matrix)

$$K = \begin{bmatrix} k(\mathbf x_1, \mathbf x_1) & k(\mathbf x_1, \mathbf x_2) & \cdots & k(\mathbf x_1, \mathbf x_n) \\ k(\mathbf x_2, \mathbf x_1) & k(\mathbf x_2, \mathbf x_2) & \cdots & k(\mathbf x_2, \mathbf x_n) \\ \vdots & \vdots & \ddots & \vdots\\ k(\mathbf x_n, \mathbf x_1) & k(\mathbf x_n, \mathbf x_2) & \cdots & k(\mathbf x_n, \mathbf x_n) \\ \end{bmatrix}$$


If we have a linear learning machine with parameters $\mathbf w$,

  • then the function we try to learn is modeled by $f(\mathbf x) = \mathbf w^T \varphi(\mathbf x)$
  • we can express $\mathbf w$ as a combination of training points: $\mathbf w = \sum_{i = 1}^{n} \alpha_i \, \varphi(\mathbf x_i)$ (i.e. the solution lies in the span of the data)
  • then $f(\mathbf x) = \sum_{i = 1}^{n} \alpha_i \, k(\mathbf x_i, \mathbf x)$


Mercer Kernels

Types

We can combine Kernels:

  • given base kernel $k(\mathbf x, \mathbf z)$ (can be a dot product $k(\mathbf x, \mathbf z) = \langle \mathbf x, \mathbf z \rangle)$)
  • Polynomial Kernel: $k'(\mathbf x, \mathbf z) = \big( k(\mathbf x, \mathbf z) + D \big)^p$
  • Gaussian Kernel: $k'(\mathbf x, \mathbf z) = \exp \left( \cfrac{k(\mathbf x, \mathbf x) - 2 \, k(\mathbf x, \mathbf z) + k(\mathbf z, \mathbf z)}{\sigma^2} \right)$


References

  • Hofmann, Thomas, Bernhard Schölkopf, and Alexander J. Smola. "Kernel methods in machine learning." 2008. [2]


Sources

  • Cristianini, Nello, John Shawe-Taylor, and Huma Lodhi. "Latent semantic kernels." 2002. [3]