# ML Wiki

## Term Clustering

Term clustering is a dual problem of Document Clustering

Duality:

• when we use Vector Space Models, e.g. Bag of Words, then we have a term-document matrix $D$
• rows of $D$ are documents, columns of $D$ are terms
• can cluster columns instead of rows!
• clustering rows and clustering columns are very related problems

Term Clustering groups words with a high degree of semantic relatedness

• so we can use clusters (centroids of terms) to represent terms

Li Jain 1998

• view semantic relatedness between words in terms of their co-occurrence and co-absence in the corpus

## Clustering of Terms

How to do this?

• try applying usual row clustering techniques on $D^T$

### Frequent Termset

Apply Local Pattern Discovery and Frequent Patterns Mining techniques for terms:

• can see a document as a transaction and words like items
• we want to find frequent itemsets of words in these documents
• it's called Frequent Word Patterns

### Two-Phase Document Clustering

Main idea:

• use Mutual Information to find best term clustering
• and then use mutual information to find best document clustering

## Simultaneous Term/Document Clustering

Simultaneous clustering of rows and columns is called Co-Clustering

## References

• Slonim, Noam, and Naftali Tishby. "Document clustering using word clusters via the information bottleneck method." 2000. [1]

## Sources

• Li, Yong H., and Anil K. Jain. "Classification of text documents." (1998) [2]
• Sebastiani, Fabrizio. "Machine learning in automated text categorization." (2002). [3]