Term Clustering
Term clustering is a dual problem of Document Clustering
- it’s also unsupervised Text Mining technique, but applied to terms instead of documents
- term clustering may be good technique for Dimensionality Reduction
Duality:
- when we use Vector Space Models, e.g. Bag of Words, then we have a term-document matrix $D$
- rows of $D$ are documents, columns of $D$ are terms
-
can cluster columns instead of rows - clustering rows and clustering columns are very related problems
Term Clustering groups words with a high degree of semantic relatedness
- so we can use clusters (centroids of terms) to represent terms
Li Jain 1998
- view semantic relatedness between words in terms of their co-occurrence and co-absence in the corpus
Clustering of Terms
How to do this?
- try applying usual row clustering techniques on $D^T$
=== Frequent Termset === Apply Local Pattern Discovery and Frequent Patterns Mining techniques for terms:
- can see a document as a transaction and words like items
- we want to find frequent itemsets of words in these documents
- it’s called Frequent Word Patterns
Two-Phase Document Clustering
Main idea:
- use Mutual Information to find best term clustering
- and then use mutual information to find best document clustering
Simultaneous Term/Document Clustering
Simultaneous clustering of rows and columns is called Co-Clustering
- simplest way is to use Non-Negative Matrix Factorization
References
- Slonim, Noam, and Naftali Tishby. “Document clustering using word clusters via the information bottleneck method.” 2000. [http://lsa3.colorado.edu/LexicalSemantics/slonim00document.pdf]
Sources
- Li, Yong H., and Anil K. Jain. “Classification of text documents.” (1998) [http://julio.staff.ipb.ac.id/files/2014/09/LiJ98.pdf]
- Sebastiani, Fabrizio. “Machine learning in automated text categorization.” (2002). [http://arxiv.org/pdf/cs/0110053.pdf]