Term Contribution

feature-selection

Idea: result of clustering highly depends on how similar are documents

so contribution of a term $t$ is how much it contributes to similarity of two documents

Text clustering is highly dependent on the documents similarity.

Suppose use a Dot Product based similarity:
$\text{similarity}(d_i, d_j) = \sum_{t \in V} f(t, d_i) \times f(t, d_j)$
- where $f(t, d)$ represents the weight of term $t$ in document $d$

The contribution of each term is the overall contribution to documents’ similarities and shown by the following equation:

It’s slow - $O(n^2)$

Sources

Liu, Tao, et al. “An evaluation on feature selection for text clustering.” ICML. Vol. 3. 2003. link
Aggarwal, Charu C., and ChengXiang Zhai. “A survey of text clustering algorithms.” Mining Text Data. Springer US, 2012. link
http://cs.gmu.edu/~carlotta/teaching/INFS-795-s05/readings/INFS795_MCayci.ppt

✏️ Edit on GitHub