Frequent Word Patterns

Frequent word patters is a technique of Local Pattern Discovery applied to documents

  • we can see a document as a transaction and words like items
  • then want to find frequent itemsets of words in these documents - like in Frequent Patterns Mining with Apriori or Eclat
  • frequent itemset $\equiv$ frequent wordset


Document Clustering

We can use FPM for Term Clustering

  • cluster = all documents that contain a certain frequent term set
  • so frequent term sets describe clusters
  • note that here clustering is not strict (it's Fuzzy Clustering): it allows some overlap between clusters
  • which is sometimes natural in text documents


Problem formalization

  • let $R$ be set of chosen frequent term sets (FTS)
  • $f_i$ be the # of FTSs from $R$ contained in document $d_i$
  • we put a constraint on $f_i$: it must be at least one to ensure complete coverage (there should be no documents without category)
  • we want: minimize the average value of $f_i - 1$


Algorithm:

  • at each iteration
  • pick FTS with minimal overlap with other clusters
  • see more in the reference



References

  • Beil, Florian, Martin Ester, and Xiaowei Xu. "Frequent term-based text clustering." 2002. [1]

Sources

  • Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [2]

Machine Learning Bookcamp: Learn machine learning by doing projects. Get 40% off with code "grigorevpc".

Share your opinion