Term Contribution

Idea: result of clustering highly depends on how similar are documents

so contribution of a term $t$ is how much it contributes to similarity of two documents

Text clustering is highly dependent on the documents similarity.

  • Suppose use a Dot Product based similarity:
  • $\text{similarity}(d_i, d_j) = \sum_{t \in V} f(t, d_i) \times f(t, d_j)$
    • where $f(t, d)$ represents the weight of term $t$ in document $d$


The contribution of each term is the overall contribution to documents’ similarities and shown by the following equation:

  • $\text{TC}(t) = \sum_{i,j} f(t, d_i) \times f(t, d_j)$


It's slow - $O(n^2)$

  • sample to speed it up


Sources