Given $D$ features $f_1, \ ... \ , f_D$ and outcome $Y$
- rank these features according to some criterion of "importance"
- keep only important ones
- top $d$
- ones with scores above some threshold
Criteria of usefulness :
- All these functions capture the intuition that the best features for predicting the outcome $Y$ is ones that distribute very differently given values of $Y$
- Usually these functions measure (in)dependence between $f_i$ and $Y$
- the more dependent the feature is, the better it is for classification
E.g. $\chi^2$ measures how the results of an observation differs from the result expected according to the null hypothesis
- lower values indicate less dependency
- so for $\chi^2$ we want to take biggest values
- Sebastiani, Fabrizio. "Machine learning in automated text categorization." (2002).