Term Strength
Term Strength is a technique for Feature Selection in Text Mining
- it doesn’t need a pre-defined list of Stop Words - it discovers them automatically
- so it’s a technique for vocabulary reduction in text retrieval
- this method estimates term importance based on how often a term appears in “related” documents
'’Strength’’ of a term $t$
- measures how informative a word is for identifying two related documents
- $s(t) = P(t \in y \mid t \in x)$
- for two related documents $x, y$ what’s the probability that $t$ belongs to $y$ given it belongs to $x$?
- estimate $s(t)$ on training data using Maximum Likelihood Estimation
What does it mean “related”?
- if we know the labels of these documents, then related are those that belong to the same category
- what about Unsupervised Learning?
Document Clustering
Can we use this for unsupervised learning?
How to find such $x$ and $y$?
- manual or with user feedback - not practical
Can we automate it?
Yes (Wilbur1992):
- use Cosine Similarity to find most related documents
- set some threshold $t$ and let all pairs with cosine $> t$ be related
Then we can estimate $s(t)$ using Maximum Likelihood Estimation for Multinomial Distribution \(\hat s(t) = \cfrac{\text{# of pairs where $t$ occurs both in $x$ and $y$}}{\text{# of pairs where $t$ occurs in $x$}}\)
Pruning
Let expected strength be $z = \mathbb E_t [s(t)]$
- we estimate $z$ as $\hat z = \cfrac{1}- let $\sigma = \text{sd}\big( s(t) \big)$ - how much $s(t)$ varies
- we prune term $t$ if $s(t) \leqslant 2 \sigma \, z$
Sources
- Aggarwal, Charu C., and ChengXiang Zhai. “A survey of text clustering algorithms.” Mining Text Data. Springer US, 2012. [http://ir.nmu.org.ua/bitstream/handle/123456789/144935/d1784ebed3eab2708026b202b2b65309.pdf?sequence=1#page=90]
- Wilbur, W. John, and Karl Sirotkin. “The automatic identification of stop words.” 1992. [https://www.researchgate.net/publication/247786801_The_automatic_identification_of_stop_words]