Data Discretization

What if we want to transform a continuous attribute to a categorical?


Equal-Width Partitioning

Also called distance partitioning

  • want to divide $X = (x_1, ..., x_m)$ into $N$ equal intervals
  • let $A = \min X$ and $B = \max X$
  • width: $W = \cfrac{B - A}{N}$
  • discretization-equal-width.png
  • suppose that in one such partition you have all your data
  • you'll lose a lot of information
  • so it's sensible to Outliers


Equal-Depth Partitioning

Also called frequency partitioning

  • Divides $X$ into $N$ intervals,
  • with each interval containing approximately same number of samples
  • not sensible to outliers
  • distribution of values is taken into account
  • discretization-equal-depth.png


Entropy-Based Discretization

Uses entropy to find the best way to split your data

  • find the value $\alpha$ that maximizes the Information Gain
  • split by $\alpha$
  • repeat recursively until have $N$ intervals or no information gain is possible


Sources