Term Clustering

Term clustering is a dual problem of Document Clustering


Duality:

  • when we use Vector Space Models, e.g. Bag of Words, then we have a term-document matrix $D$
  • rows of $D$ are documents, columns of $D$ are terms
  • can cluster columns instead of rows!
  • clustering rows and clustering columns are very related problems


Term Clustering groups words with a high degree of semantic relatedness

  • so we can use clusters (centroids of terms) to represent terms


Li Jain 1998

  • view semantic relatedness between words in terms of their co-occurrence and co-absence in the corpus


Clustering of Terms

How to do this?

  • try applying usual row clustering techniques on $D^T$


Frequent Termset

Apply Local Pattern Discovery and Frequent Patterns Mining techniques for terms:

  • can see a document as a transaction and words like items
  • we want to find frequent itemsets of words in these documents
  • it's called Frequent Word Patterns


Two-Phase Document Clustering

Main idea:

  • use Mutual Information to find best term clustering
  • and then use mutual information to find best document clustering


Simultaneous Term/Document Clustering

Simultaneous clustering of rows and columns is called Co-Clustering


References

  • Slonim, Noam, and Naftali Tishby. "Document clustering using word clusters via the information bottleneck method." 2000. [1]


Sources

  • Li, Yong H., and Anil K. Jain. "Classification of text documents." (1998) [2]
  • Sebastiani, Fabrizio. "Machine learning in automated text categorization." (2002). [3]