Feature Filtering

Given $D$ features $f_1, \ ... \ , f_D$ and outcome $Y$

  • rank these features according to some criterion of "importance"
  • keep only important ones


  • top $d$
  • ones with scores above some threshold

Criteria of usefulness :

  • All these functions capture the intuition that the best features for predicting the outcome $Y$ is ones that distribute very differently given values of $Y$
  • Usually these functions measure (in)dependence between $f_i$ and $Y$
  • the more dependent the feature is, the better it is for classification

E.g. $\chi^2$ measures how the results of an observation differs from the result expected according to the null hypothesis

  • lower values indicate less dependency
  • so for $\chi^2$ we want to take biggest values


  • Sebastiani, Fabrizio. "Machine learning in automated text categorization." (2002). [1]