Dimensionality Reduction

This is a technique to reduce the dimensionality of our data sets

  • we have a data set of $\{ \mathbf x_i \}$ of $\mathbf x_i \in \mathbb R^D$ with very large $D$
  • the goal is to find a mapping $f: \mathbb R^D \mapsto \mathbb R^d$ s.t. $d \ll D$
  • for Visualization the target dimension is usually small, e.g. $d = 2$ or $d =3$


  • DR techniques tend to reduce Overfitting:
  • if dimensionality of data is $D$ and there are $N$ examples in the training set
  • then it's good to have $D \approx N$ to avoid overfitting


  • Note that DR techniques sometimes may remove important information
  • Aggressiveness of reduction is $D / d$

Feature Selection

Information Retrieval and Text Mining

In IR these techniques are usually called "Term Selection" rather than "Feature selection"

Usual IR and indexing techniques for reducing dimensionality are

Term Clustering

General Techniques

Feature Extraction

Factor Analysis

Generate new features based on the original ones