Term Clustering
Term clustering is a dual problem of Document Clustering
Duality:
- when we use Vector Space Models, e.g. Bag of Words, then we have a term-document matrix $D$
- rows of $D$ are documents, columns of $D$ are terms
- can cluster columns instead of rows!
- clustering rows and clustering columns are very related problems
Term Clustering groups words with a high degree of semantic relatedness
- so we can use clusters (centroids of terms) to represent terms
Li Jain 1998
- view semantic relatedness between words in terms of their co-occurrence and co-absence in the corpus
Clustering of Terms
How to do this?
- try applying usual row clustering techniques on $D^T$
Frequent Termset
Apply Local Pattern Discovery and Frequent Patterns Mining techniques for terms:
- can see a document as a transaction and words like items
- we want to find frequent itemsets of words in these documents
- it's called Frequent Word Patterns
Main idea:
- use Mutual Information to find best term clustering
- and then use mutual information to find best document clustering
Simultaneous Term/Document Clustering
Simultaneous clustering of rows and columns is called Co-Clustering
References
- Slonim, Noam, and Naftali Tishby. "Document clustering using word clusters via the information bottleneck method." 2000. [1]
Sources
- Li, Yong H., and Anil K. Jain. "Classification of text documents." (1998) [2]
- Sebastiani, Fabrizio. "Machine learning in automated text categorization." (2002). [3]