# ML Wiki

## Two-Phase Document Clustering

How can we use Term Clustering for Document Clustering?

Two Phases

• term clustering
• document clustering

### Notation

• let $D$ be term-document matrix: rows are documents and columns are terms
• $X = \{ \mathbf x_1 , \ ... \ , \mathbf x_n \}$ - random vectors: rows of $D$
• $Y = \{ \mathbf y_1 , \ ... \ , \mathbf y_d \}$ - random vectors: columnts of $D$

We want to partition:

• $Y$ into $l$ clusters $\hat Y = \{ \hat Y_1 , \ ... \ , \hat Y_l \}$
• $X$ into $k$ clusters $\hat X = \{ \hat X_1 , \ ... \ , \hat X_k \}$

### Problem Statement

Formally, we want to find mappings:

• $C_Y : Y \to \hat Y$ or $\mathbf y_1 , \ ... \ , \mathbf y_d \to \hat Y_1, \ ... \ , \hat Y_l$
• $C_X : X \to \hat X$, or $\mathbf x_1 , \ ... \ , \mathbf x_n \to \hat X_1, \ ... \ , \hat X_k$

### Phase One: Term Clustering

Find word clustering s.t.

• most of Mutual Information between words and documents is preserved
• when we go from representing docs in terms of words to representing docs in terms of word clusters

Cluster:

• find clustering of $Y$ into $\hat Y$ s.t. information from $I(X; Y)$ is preserved in $I(X; \hat Y)$ as good as possible

How?

• $P(X, Y)$ always has more information than $P(X, \hat Y)$, so $I(X; \hat Y) \leqslant I(X; Y)$
• thus find such mapping $C_Y$ that minimizes $I(X; Y) - I(X; \hat Y)$
• this loss function is called "Mutual information loss"

### Phase Two: Document Clustering

Using term clusters:

• perform document clustering
• cluster $X$ into $\hat X$ s.t. we preserve as much of $I(X; \hat Y)$ as possible in $I(\hat X ; \hat Y)$

Same here:

• minimize $I(X; \hat Y) - I(\hat X; \hat Y)$

For details, see Slonim2000

## References

• Slonim, Noam, and Naftali Tishby. "Document clustering using word clusters via the information bottleneck method." 2000. [1]

## Sources

• Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [2]