# ML Wiki

## Collocation Extraction

There's no commonly accepted definition of "collocation", but it's usually a multiple-word token (i.e. an $n$-gram) where

• the words of the ngram co-occur together more often than by chance alone
• usually they mean something collectively, something different from what each word means alone

Other names for collocations:

• compound terms
• multi-word terms
• $n$-grams

### Types of Collocations

There are two types of collocations:

• Open Compounds
• Compositional Collocation

Open Compounds

• uninterrupted sequences of words that generally function as a single "word" in the text
• e.g. "stock market", "foreign exchange"

Compositional Collocation

• a sequence interrupted by preposition or a conjunction
• e.g. "cure for cancer", "guns and ammunitions"

### Applications

There are following applications for Collocation Extraction

• in IR these words may represent documents better than unigram tokens
• some tri-grams and higher order $n$-grams may be good for IR, even though they aren't compound term - in the sense of the definitions above.

### Identification of Collocations

How to find and extract collocations?

• by "co-occur more often than by chance" we can mean independence
• so we can test if the words in the collocation are independent or not
• if they aren't - then maybe it's a collocation
• we also can use some measures than quantify the dependence between words - and the higher the measure, the more likely the words form a collocation

## Two Word Collocations

Let us start by considering the bigram collocation case

• suppose the collocation candidate is $(w_1, w_2)$
• if both $w_1$ and $w_2$ are frequent, they may co-locate just by chance, even if they don't form a collocation
• we want to test if this collocation occurs significantly more often than by chance
• for that we can compare how often $w_1$ and $w_2$ occur independently, and how often they occur together
• can do Hypothesis Testing for that

General framework:

• under the Independence hypothesis ($H_0$) we assume that there is no association between $w_1$ and $w_2$, i.e. they are independent
• let $P(w_1)$ and $P(w_2)$ are probabilities that a random token in a text is $w_1$ and $w_2$ resp.
• and $P(w_1, w_2)$ is the probability that $(w_1, w_2)$ occur together in the text (i.e. one follows another)
• so under $H_0$, $P(w_1, w_2) = P(w_1)\, P(w_2)$
• we can compute the observed probability of $P(w_1, w_2)$ and compare it with the probability under $H_0$
• if these probabilities are significantly different from each other, then $(w_1, w_2)$ is a collocation

Ranking candidates

• if we can measure the degree of dependence, we can rank the candidate collocation
• usually tests have some test statistics which we can use for ranking candidates
• often it is more interesting to look at the top ranking candidates rather than at all of them

Ways to test/rank:

### $T$-tests

$t$-test can also be used for collocation discovery

• it looks at the mean and variance
• can use $t$-test for proportions: $t = \cfrac{\bar x - \mu}{\sqrt{s^2 / N}}$

The idea:

• The probability of generating a bigram $(w_1, w_2)$ is a Bernoulli Random Variable with $p= P(w_1, w_2)$
• for Bernoulli, $\mu = p$ and $\sigma^2 = p\, (1-p)$
• because $p$ is typically quite small, $\sigma^2 \approx p$ for most bigrams
• then we can calculate the $t$-value and the corresponding critical value ($N$ = number of unigrams)
• and if $t$ is large enough, reject $H_0$
• but in this case, it's more useful to rank values by their $t$ rather than just know if a bigram passes a test or not

### Other Tests

Other tests can also be used

it's applied to 2-by-2 table

• in essence, it compares observed frequencies to expected frequencies
• if the difference is large - reject $H_0$
• $X^2 = \sum_{ij} \cfrac{(O_{ij} - E_{ij})^2}{E_{ij}}$
• $X^2$ is distributed as $\chi^2$

### Odds Ratio Test

To compare $P(w_1, w_2)$ with $P(w_1) \, P(w_2)$ we can use the Odds Ratio:

• Odds Ratio is $\cfrac{P(w_1, w_2)}{P(w_1) \, P(w_2)}$
• collocations should have scores higher than 1
• the higher the ratio, the more likely a bigram is a collocation

### Point-Wise Mutual Information

• it's a measure on how much one word tells about the other
• In Information Theory, Mutual Information is defined between Random Variables, not words (values of RVs)
• which is why we use PMI

Point-Wise Mutual Information (PMI):

• instead of Odds Ratio, can use Log Odds:
• $\log \cfrac{P(w_1, w_2)}{P(w_1) \, P(w_2)}$
• then it becomes Point-Wise Mutual Information
• $\text{PMI}(w_1, w_2) = \log \cfrac{P(w_1, w_2)}{P(w_1)\, P(w_2)} = \log \cfrac{P(w_1) \, P(w_2 \mid w_1)}{P(w_1)\, P(w_2)} = \log \cfrac{P(w_2 \mid w_1)}{P(w_2)}$

PMI is the amount of information we get when

• a word $w_1$ occurs at position $i$
• and it's followed by $w_2$ at position $i+1$

PMI can generalize to any $n$-grams

• suppose $\mathbf x$ and $\mathbf y$ are vectors (not necessarily of the same dimensions)
• then $\text{PMI}(\mathbf x, \mathbf y) = \log \cfrac{P(\mathbf x, \mathbf y)}{P(\mathbf x)\, P(\mathbf y)}$
• also see the #$n$-Gram Collocations section

## Estimates

### Maximum Likelihood Estimator

Estimation of $P(w_1)$, $P(w_2)$ and $P(w_1, w_2)$:

MLE of $P(w_1)$ and $P(w_2)$:

• $\hat P(w) = \cfrac{c(w)}{N}$, where
• $c(w)$ is the number of times $w_1$ appeared in the corpus, and
• $N$ is the total number of tokens

MLE of $P(w_1, w_2)$

• $P(w_1, w_2) = P(w_1) \, P(w_2 \mid w_1)$, so
• $\hat P(w_1, w_2) = \hat P(w_1) \, \hat P(w_2 \mid w_1) = \cfrac{c(w_1)}{N} \cdot \cfrac{c(w_1, w_2)}{c(w_1)} = \cfrac{c(w_1, w_2)}{N}$

so MLE estimate of PMI is

• $\text{PMI}(w_1, w_2) = \log \cfrac{\hat P(w_2 \mid w_1)}{\hat P(w_2)} = \log \left( \cfrac{c(w_1, w_2)}{c(w_1)} \cdot \cfrac{N}{c(w_2)} \right)$

## $n$-Gram Collocations

For 3-grams:

• treat $(w_1, w_2)$ as a single var
• $\text{PMI}\Big((w_1, w_2), w_3 \Big) = \log \cfrac{P\big((w_1, w_2), w_3\big)}{P(w_1, w_2)\, P(w_3)} = \log \cfrac{P\big(w_1, w_2, w_3\big)}{P(w_1, w_2)\, P(w_3)}$
• $P(w_1, w_2, w_3) = P(w_1) \, P(w_2 \mid w_1) \, P(w_3 \mid w_1, w_2)$
• $\hat P(w_1, w_2, w_3) = \cfrac{c(w_1)}{N} \cdot \cfrac{c(w_1, w_2)}{c(w_1)} \cdot \cfrac{c(w_1, w_2, w_3)}{c(w_1, w_2)} = \cfrac{c(w_1, w_2, w_3)}{N}$
• So estimate of $\text{PMI}\big((w_1, w_2), w_3\big)$ is
• $\text{PMI}\big((w_1, w_2), w_3\big) = \log \left( \cfrac{c(w_1, w_2, w_3)}{N} \cdot \cfrac{N}{c(w_1, w_2)} \cdot \cfrac{N}{c(w_3)} \right) = \log \cfrac{N \cdot c(w_1, w_2, w_3)}{c(w_1, w_2) \, c(w_3)}$

For $n$-grams:

• $\text{PMI}\big((w_1, \, ... \, , w_n), w_{n+1} \big) = \log \cfrac{P(w_1, \ ... \ , w_{n+1})}{P(w_1, \ ... \ , w_n)\, P(w_{n+1})}$
• Estimate:
• $\text{PMI}\big((w_1, \, ... \, , w_n), w_{n+1} \big) = \log \cfrac{N \cdot c(w_1, \ ... \ , w_{n+1})}{c(w_1, \ ... \ , w_n) \, w(n_{n+1})}$

## References

• McInnes B. T. "Extending the log likelihood measure to improve collocation identification", 2004.
• Boulis C. "Clustering of Cepstrum Coefficients Using Pairwise Mutual Information", 2002.
• Tadić M., Šojat K. "Finding multiword term candidates in Croatian", 2003.
• Gerlof Bouma, "Normalized (Pointwise) Mutual Information in Collocation Extraction" [1]

## Sources

• De Kok D., Brouwer H. "Natural language processing for the working programmer". 2011. Chapter 3 [2] archived version version, google scholar
• Manning C. D., Schütze H. "Foundations of statistical natural language processing", 1999. Chapter 5 [3]
• Petrović S. et al. "Comparison of collocation extraction measures for document indexing", 2006. [4]