Consider feature $F_i$ as a Random Variable with a value $f_i$

Entropy is

- $H(F_1, \ ... \ , F_M) = - \sum_{f_1} \ ... \ \sum_{f_M} p(f_1, \ ... \ , f_M) \log p(f_1, \ ... \ , f_M)$

When the data has well-formed clusters, the uncertainty is low so is the entropy.

- In the real-world data, there are few cases that the clusters are well-formed.
- Two points belonging to the same cluster or 2 different clusters will contribute to the total entropy less that if they were uniformly separated.
- Similarity $S_{ij}$ between two instances $X_i$ and $X_j$ is high if the 2 instance are very close and $S_{ij}$ is low if the 2 are far away. Entropy $H_{ij}$ will be low if $S_{ij}$ is either high or low, and $H_{ij}$ will be low otherwise.

We can measure the quality of a term $t$ by amount of Entropy it removes when we prune $t$

$$H(t) = \sum_{i = 1}^n \sum_{j = 1}^n \Big[ S_{ij} \log S_{ij} + (1 - S_{ij}) \log (1 - S_{ij}) \Big] $$

where $S_{ij}$ is similarity between documents $i$ and $j$ when feature $f$ is removed

$$S_{ij} = 2^{-\frac{\text{dist}(i, j)}{\text{avg.dist}}}$$

- $\text{dist}(i, j)$ distance between $i$ and $j$ when $t$ is removed
- $\text{avg.dist}$ - average distance when

But it's very slow: $O(n^2)$ for each $t$

- Dash, Manoranjan, and Huan Liu. "Feature selection for clustering." 2000. [1]

- Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [2]
- http://cs.gmu.edu/~carlotta/teaching/INFS-795-s05/readings/INFS795_MCayci.ppt