Frequent Word Patterns
Frequent word patters is a technique of Local Pattern Discovery applied to documents
- we can see a document as a transaction and words like items
- then want to find frequent itemsets of words in these documents - like in Frequent Patterns Mining with Apriori or Eclat
- frequent itemset $\equiv$ frequent wordset
We can use FPM for Term Clustering
- cluster = all documents that contain a certain frequent term set
- so frequent term sets describe clusters
- note that here clustering is not strict (it's Fuzzy Clustering): it allows some overlap between clusters
- which is sometimes natural in text documents
Problem formalization
- let $R$ be set of chosen frequent term sets (FTS)
- $f_i$ be the # of FTSs from $R$ contained in document $d_i$
- we put a constraint on $f_i$: it must be at least one to ensure complete coverage (there should be no documents without category)
- we want: minimize the average value of $f_i - 1$
Algorithm:
- at each iteration
- pick FTS with minimal overlap with other clusters
- see more in the reference
References
- Beil, Florian, Martin Ester, and Xiaowei Xu. "Frequent term-based text clustering." 2002. [1]
Sources
- Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [2]