ML Wiki
Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

Term Clustering

Term Clustering

Term clustering is a dual problem of Document Clustering

Duality:

  • when we use Vector Space Models, e.g. Bag of Words, then we have a term-document matrix $D$
  • rows of $D$ are documents, columns of $D$ are terms
  • can cluster columns instead of rows   - clustering rows and clustering columns are very related problems

Term Clustering groups words with a high degree of semantic relatedness

  • so we can use clusters (centroids of terms) to represent terms

Li Jain 1998

  • view semantic relatedness between words in terms of their co-occurrence and co-absence in the corpus

Clustering of Terms

How to do this?

  • try applying usual row clustering techniques on $D^T$

=== Frequent Termset === Apply Local Pattern Discovery and Frequent Patterns Mining techniques for terms:

  • can see a document as a transaction and words like items
  • we want to find frequent itemsets of words in these documents
  • it’s called Frequent Word Patterns

Two-Phase Document Clustering

Main idea:

  • use Mutual Information to find best term clustering
  • and then use mutual information to find best document clustering

Simultaneous Term/Document Clustering

Simultaneous clustering of rows and columns is called Co-Clustering

References

  • Slonim, Noam, and Naftali Tishby. “Document clustering using word clusters via the information bottleneck method.” 2000. [http://lsa3.colorado.edu/LexicalSemantics/slonim00document.pdf]

Sources

  • Li, Yong H., and Anil K. Jain. “Classification of text documents.” (1998) [http://julio.staff.ipb.ac.id/files/2014/09/LiJ98.pdf]
  • Sebastiani, Fabrizio. “Machine learning in automated text categorization.” (2002). [http://arxiv.org/pdf/cs/0110053.pdf]