Dimensionality Reduction

This is a technique to reduce the dimensionality of our data sets

  • we have a data set of $\{ \mathbf x_i \}$ of $\mathbf x_i \in \mathbb R^D$ with very large $D$
  • the goal is to find a mapping $f: \mathbb R^D \mapsto \mathbb R^d$ s.t. $d \ll D$
  • for Visualization the target dimension is usually small, e.g. $d = 2$ or $d =3$


Overfitting

  • DR techniques tend to reduce Overfitting:
  • if dimensionality of data is $D$ and there are $N$ examples in the training set
  • then it's good to have $D \approx N$ to avoid overfitting


Agressiveness

  • Note that DR techniques sometimes may remove important information
  • Aggressiveness of reduction is $D / d$


Feature Selection

Information Retrieval and Text Mining

In IR these techniques are usually called "Term Selection" rather than "Feature selection"


Usual IR and indexing techniques for reducing dimensionality are

Term Clustering


General Techniques


Feature Extraction

Factor Analysis

Generate new features based on the original ones


Linear


Non-Linear


Links


Sources