Two-Phase Document Clustering

How can we use Term Clustering for Document Clustering?


Two Phases

  • term clustering
  • document clustering


Notation

  • let $D$ be term-document matrix: rows are documents and columns are terms
  • $X = \{ \mathbf x_1 , \ ... \ , \mathbf x_n \}$ - random vectors: rows of $D$
  • $Y = \{ \mathbf y_1 , \ ... \ , \mathbf y_d \}$ - random vectors: columnts of $D$


We want to partition:

  • $Y$ into $l$ clusters $\hat Y = \{ \hat Y_1 , \ ... \ , \hat Y_l \}$
  • $X$ into $k$ clusters $\hat X = \{ \hat X_1 , \ ... \ , \hat X_k \}$


Problem Statement

Formally, we want to find mappings:

  • $C_Y : Y \to \hat Y$ or $\mathbf y_1 , \ ... \ , \mathbf y_d \to \hat Y_1, \ ... \ , \hat Y_l $
  • $C_X : X \to \hat X$, or $\mathbf x_1 , \ ... \ , \mathbf x_n \to \hat X_1, \ ... \ , \hat X_k $


Phase One: Term Clustering

Find word clustering s.t.

  • most of Mutual Information between words and documents is preserved
  • when we go from representing docs in terms of words to representing docs in terms of word clusters


Cluster:

  • find clustering of $Y$ into $\hat Y$ s.t. information from $I(X; Y)$ is preserved in $I(X; \hat Y)$ as good as possible


How?

  • $P(X, Y)$ always has more information than $P(X, \hat Y)$, so $I(X; \hat Y) \leqslant I(X; Y)$
  • thus find such mapping $C_Y$ that minimizes $I(X; Y) - I(X; \hat Y)$
  • this loss function is called "Mutual information loss"


Phase Two: Document Clustering

Using term clusters:

  • perform document clustering
  • cluster $X$ into $\hat X$ s.t. we preserve as much of $I(X; \hat Y)$ as possible in $I(\hat X ; \hat Y)$


Same here:

  • minimize $I(X; \hat Y) - I(\hat X; \hat Y)$


For details, see Slonim2000


References

  • Slonim, Noam, and Naftali Tishby. "Document clustering using word clusters via the information bottleneck method." 2000. [1]

Sources

  • Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [2]