# ML Wiki

$\require{cancel}$

## Latent Semantic Kernels

In Information Retrieval via Vector Space Model, retrieval is based on inner product as well

• so can also use Kernels
• You can already use SVM for text data and get very good performance
• but also can incorporate additional information by using a kernel
• In traditional Vector Space Model semantic relationships are not taken into account
• Goal: design a "Semantic Kernel": a kernel which creates a map that captures semantic information

Building the Kernel:

• Use Semantic Network (e.g. WordNet) to capture similarity between words and encode this information into kernel
• use Latent Semantic Analysis to learn the semantic space and a mapping function that brings input space to the semantic spaces
• in semantic space documents that don't share any terms still can be close if they terms are semantically related
• this semantic similarity is inferred by analyzing co-occurrence patterns: terms that co-occur often in the same documents are considered related
• this statistical co-occurrence is extracted via SVD

## Vector Space Model

Suppose we have a term-document matrix $D$

• then $G = D^T D$ is a doc-by-doc matrix and $T = D D^T$ matrix
• can define a base kernel as $k(\mathbf d_1, \mathbf d_2) = \mathbf d_1^T \mathbf d_2$
• suppose we apply some Linear Transformation $\phi$: to documents: $\phi(\mathbf d) = P \, \mathbf d$
• (usual VSM: $P = I$)
• then kernel becomes $k(\mathbf d_1, \mathbf d_2) = \mathbf d_1^T P^T P \mathbf d_2$ and the kernel matrix $K = D^T P^T \, P \, D$
• now can build a new kernel $k'$ using the base kernel $k$ (e.g. polynomial or gaussian) - to increase the expressive power of the VSM model

## Semantic Kernels

But in reality usual VSM models have problems with synonymy and polysemy - these kernels can't handle that

how to enrich kernels with semantic information?

• document expansion: add all synonyms to the document
• or replace words by concepts (can be taken from a semantic network or learned)
• use information about term-term correlation!

So $K = D^T P^T \, P \, D$

• let $P_{ij}$ denote semantic proximity between terms $i$ and $j$
• then is a square symmetric matrix
• so have $k(\mathbf d_1, \mathbf d_2) = \mathbf d_1^T P^T P \mathbf d_2 = \mathbf d_1^T P^2 \, \mathbf d_2$
• also note that $\| P\, \mathbf d_1 - P\, \mathbf d_2 \|^2 = \| P \, (\mathbf d_1 - \mathbf d_2) \|^2 = (\mathbf d_1 - \mathbf d_2)^T P^2 \, (\mathbf d_1 - \mathbf d_2)$: can use this to apply Gaussian Kernel

$P$ can also be concept-term similarity matrix, but then $P$ will not be symmetric

### LSI Kernel

LSI: document feature vector $\mathbf d$ is projected onto the subspace spanned by first $k$ singular vectors if the feature space

• apply SVD to $D$: $D = U \Sigma V^T$ where $U$ contains the singular vectors of the feature space
• the projection on $k$ first singular values gives $U_k$, so let $P = U_k^T$
• $U_k$ identifies terms that co-occur most often
• this have $k(\mathbf d_1, \mathbf d_2) = (U_k^T \mathbf d_1)^T U_k^T \mathbf d_2 = \mathbf d_1^T U_k \, U_k^T \mathbf d_2$

Suppose we have our kernel classifier

• $f(\mathbf d) = \sum_{i = 1}^{n} \alpha_i \, k(\mathbf d_i, \mathbf d)$
• then by using this kernel, we have
• $f(\mathbf d) = \sum_{i = 1}^{n} \alpha_i \, \big( \mathbf d_i^T U_k \, U_k^T \mathbf d \big)$
• now consider a vector $\boldsymbol \alpha = (\alpha_1, \ ... \ , \alpha_n)$
• we can replace $\big( \mathbf d_i^T U_k \, U_k^T \mathbf d \big)$ with a vector $D^T U_k \, U_k^T \mathbf d$ where $D$ has documents $\mathbf d_1, \, ... \, \mathbf d_n$ as columns
• so $f(\mathbf d) = \boldsymbol \alpha^T D^T U_k \, U_k^T \mathbf d = \ ...$ let's replace $D$ with its SVD
• $... \ = \boldsymbol \alpha^T V \, \Sigma U^T U_k \, U_k^T \mathbf d = \ ...$ replace $U_k = U \, I_k$
• $... \ = \boldsymbol \alpha^T V \, \Sigma \cancel{U^T U} I_k \, U_k^T \mathbf d = \ ...$
• $... \ = \boldsymbol \alpha^T V \, \Sigma \, I_k \, U_k^T \mathbf d = \ ...$ note that $\Sigma \, I_k = \Sigma_k$
• $... \ = \boldsymbol \alpha^T V \, \Sigma_k \, U_k^T \mathbf d = \ ...$

Can we avoid working on the feature space? We still have $\Sigma_k$ and $U_k$

• $\Sigma_k \, U_k^T \mathbf d$ - need to express in terms of the input space
• $\Sigma_k \, U_k^T \mathbf d = I_k \Sigma \, U^T \mathbf d = \ ...$
• $... \ = \Sigma_k \, U_k^T \mathbf d = I_k \underbrace{V^T V}_{I} \Sigma \, U^T \mathbf d = \ ...$
• $... \ = \Sigma_k \, U_k^T \mathbf d = I_k V^T \underbrace{V \Sigma \, U^T}_{D^T} \mathbf d = \ ...$
• $... \ = \Sigma_k \, U_k^T \mathbf d = I_k V^T D^T \mathbf d$
• let's plug it to $f(\mathbf d)$:
• $f(\mathbf d) = \boldsymbol \alpha^T V \, I_k V^T D^T \mathbf d$
• so we avoid working on the feature space directly:
• $V$ can be computed by Eigendecomposition of the kernel matrix $K$

Bottleneck:

• decomposing $K$
• approximate: Smola, Alex J., and Bernhard Schölkopf. "Sparse greedy matrix approximation for machine learning." 2000. [1]

## Sources

• Cristianini, Nello, John Shawe-Taylor, and Huma Lodhi. "Latent semantic kernels." 2002. [2]