Two-Phase Document Clustering
How can we use Term Clustering for Document Clustering?
Two Phases
- term clustering
- document clustering
Notation
- let $D$ be term-document matrix: rows are documents and columns are terms
- $X = \{ \mathbf x_1 , \ ... \ , \mathbf x_n \}$ - random vectors: rows of $D$
- $Y = \{ \mathbf y_1 , \ ... \ , \mathbf y_d \}$ - random vectors: columnts of $D$
We want to partition:
- $Y$ into $l$ clusters $\hat Y = \{ \hat Y_1 , \ ... \ , \hat Y_l \}$
- $X$ into $k$ clusters $\hat X = \{ \hat X_1 , \ ... \ , \hat X_k \}$
Problem Statement
Formally, we want to find mappings:
- $C_Y : Y \to \hat Y$ or $\mathbf y_1 , \ ... \ , \mathbf y_d \to \hat Y_1, \ ... \ , \hat Y_l $
- $C_X : X \to \hat X$, or $\mathbf x_1 , \ ... \ , \mathbf x_n \to \hat X_1, \ ... \ , \hat X_k $
Phase One: Term Clustering
Find word clustering s.t.
- most of Mutual Information between words and documents is preserved
- when we go from representing docs in terms of words to representing docs in terms of word clusters
Cluster:
- find clustering of $Y$ into $\hat Y$ s.t. information from $I(X; Y)$ is preserved in $I(X; \hat Y)$ as good as possible
How?
- $P(X, Y)$ always has more information than $P(X, \hat Y)$, so $I(X; \hat Y) \leqslant I(X; Y)$
- thus find such mapping $C_Y$ that minimizes $I(X; Y) - I(X; \hat Y)$
- this loss function is called "Mutual information loss"
Phase Two: Document Clustering
Using term clusters:
- perform document clustering
- cluster $X$ into $\hat X$ s.t. we preserve as much of $I(X; \hat Y)$ as possible in $I(\hat X ; \hat Y)$
Same here:
- minimize $I(X; \hat Y) - I(\hat X; \hat Y)$
For details, see Slonim2000
References
- Slonim, Noam, and Naftali Tishby. "Document clustering using word clusters via the information bottleneck method." 2000. [1]
Sources
- Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [2]