Collocation Extraction

There's no commonly accepted definition of "collocation", but it's usually a multiple-word token (i.e. an $n$-gram) where

  • the words of the ngram co-occur together more often than by chance alone
  • usually they mean something collectively, something different from what each word means alone

Other names for collocations:

  • compound terms
  • multi-word terms
  • $n$-grams

Types of Collocations

There are two types of collocations:

  • Open Compounds
  • Compositional Collocation

Open Compounds

  • uninterrupted sequences of words that generally function as a single "word" in the text
  • e.g. "stock market", "foreign exchange"

Compositional Collocation

  • a sequence interrupted by preposition or a conjunction
  • e.g. "cure for cancer", "guns and ammunitions"

Applications

There are following applications for Collocation Extraction

Information Retrieval

  • in IR these words may represent documents better than unigram tokens
  • some tri-grams and higher order $n$-grams may be good for IR, even though they aren't compound term - in the sense of the definitions above.


Identification of Collocations

How to find and extract collocations?

  • by "co-occur more often than by chance" we can mean independence
  • so we can test if the words in the collocation are independent or not
  • if they aren't - then maybe it's a collocation
  • we also can use some measures than quantify the dependence between words - and the higher the measure, the more likely the words form a collocation


Two Word Collocations

Let us start by considering the bigram collocation case

  • suppose the collocation candidate is $(w_1, w_2)$
  • if both $w_1$ and $w_2$ are frequent, they may co-locate just by chance, even if they don't form a collocation
  • we want to test if this collocation occurs significantly more often than by chance
  • for that we can compare how often $w_1$ and $w_2$ occur independently, and how often they occur together
  • can do Hypothesis Testing for that

General framework:

  • under the Independence hypothesis ($H_0$) we assume that there is no association between $w_1$ and $w_2$, i.e. they are independent
  • let $P(w_1)$ and $P(w_2)$ are probabilities that a random token in a text is $w_1$ and $w_2$ resp.
  • and $P(w_1, w_2)$ is the probability that $(w_1, w_2)$ occur together in the text (i.e. one follows another)
  • so under $H_0$, $P(w_1, w_2) = P(w_1)\, P(w_2)$
  • we can compute the observed probability of $P(w_1, w_2)$ and compare it with the probability under $H_0$
  • if these probabilities are significantly different from each other, then $(w_1, w_2)$ is a collocation


Ranking candidates

  • if we can measure the degree of dependence, we can rank the candidate collocation
  • usually tests have some test statistics which we can use for ranking candidates
  • often it is more interesting to look at the top ranking candidates rather than at all of them


Ways to test/rank:


$T$-tests

$t$-test can also be used for collocation discovery

  • it looks at the mean and variance
  • can use $t$-test for proportions: $t = \cfrac{\bar x - \mu}{\sqrt{s^2 / N}}$

The idea:

  • The probability of generating a bigram $(w_1, w_2)$ is a Bernoulli Random Variable with $p= P(w_1, w_2)$
  • for Bernoulli, $\mu = p$ and $\sigma^2 = p\, (1-p)$
  • because $p$ is typically quite small, $\sigma^2 \approx p$ for most bigrams
  • then we can calculate the $t$-value and the corresponding critical value ($N$ = number of unigrams)
  • and if $t$ is large enough, reject $H_0$
  • but in this case, it's more useful to rank values by their $t$ rather than just know if a bigram passes a test or not


Other Tests

Other tests can also be used

it's applied to 2-by-2 table

  • in essence, it compares observed frequencies to expected frequencies
  • if the difference is large - reject $H_0$
  • $X^2 = \sum_{ij} \cfrac{(O_{ij} - E_{ij})^2}{E_{ij}}$
  • $X^2$ is distributed as $\chi^2$


Odds Ratio Test

To compare $P(w_1, w_2)$ with $P(w_1) \, P(w_2)$ we can use the Odds Ratio:

  • Odds Ratio is $\cfrac{P(w_1, w_2)}{P(w_1) \, P(w_2)}$
  • collocations should have scores higher than 1
  • the higher the ratio, the more likely a bigram is a collocation


Point-Wise Mutual Information

Mutual Information:

  • it's a measure on how much one word tells about the other
  • In Information Theory, Mutual Information is defined between Random Variables, not words (values of RVs)
  • which is why we use PMI

Point-Wise Mutual Information (PMI):

  • instead of Odds Ratio, can use Log Odds:
  • $\log \cfrac{P(w_1, w_2)}{P(w_1) \, P(w_2)}$
  • then it becomes Point-Wise Mutual Information
  • $\text{PMI}(w_1, w_2) = \log \cfrac{P(w_1, w_2)}{P(w_1)\, P(w_2)} = \log \cfrac{P(w_1) \, P(w_2 \mid w_1)}{P(w_1)\, P(w_2)} = \log \cfrac{P(w_2 \mid w_1)}{P(w_2)}$


PMI is the amount of information we get when

  • a word $w_1$ occurs at position $i$
  • and it's followed by $w_2$ at position $i+1$


PMI can generalize to any $n$-grams

  • suppose $\mathbf x$ and $\mathbf y$ are vectors (not necessarily of the same dimensions)
  • then $\text{PMI}(\mathbf x, \mathbf y) = \log \cfrac{P(\mathbf x, \mathbf y)}{P(\mathbf x)\, P(\mathbf y)}$
  • also see the #$n$-Gram Collocations section


Estimates

Maximum Likelihood Estimator

Estimation of $P(w_1)$, $P(w_2)$ and $P(w_1, w_2)$:

MLE of $P(w_1)$ and $P(w_2)$:

  • $\hat P(w) = \cfrac{c(w)}{N}$, where
  • $c(w)$ is the number of times $w_1$ appeared in the corpus, and
  • $N$ is the total number of tokens


MLE of $P(w_1, w_2)$

  • $P(w_1, w_2) = P(w_1) \, P(w_2 \mid w_1)$, so
  • $\hat P(w_1, w_2) = \hat P(w_1) \, \hat P(w_2 \mid w_1) = \cfrac{c(w_1)}{N} \cdot \cfrac{c(w_1, w_2)}{c(w_1)} = \cfrac{c(w_1, w_2)}{N}$


so MLE estimate of PMI is

  • $\text{PMI}(w_1, w_2) = \log \cfrac{\hat P(w_2 \mid w_1)}{\hat P(w_2)} = \log \left( \cfrac{c(w_1, w_2)}{c(w_1)} \cdot \cfrac{N}{c(w_2)} \right)$


$n$-Gram Collocations

For 3-grams:

  • treat $(w_1, w_2)$ as a single var
  • $\text{PMI}\Big((w_1, w_2), w_3 \Big) = \log \cfrac{P\big((w_1, w_2), w_3\big)}{P(w_1, w_2)\, P(w_3)} = \log \cfrac{P\big(w_1, w_2, w_3\big)}{P(w_1, w_2)\, P(w_3)}$
    • $P(w_1, w_2, w_3) = P(w_1) \, P(w_2 \mid w_1) \, P(w_3 \mid w_1, w_2)$
    • $\hat P(w_1, w_2, w_3) = \cfrac{c(w_1)}{N} \cdot \cfrac{c(w_1, w_2)}{c(w_1)} \cdot \cfrac{c(w_1, w_2, w_3)}{c(w_1, w_2)} = \cfrac{c(w_1, w_2, w_3)}{N}$
  • So estimate of $\text{PMI}\big((w_1, w_2), w_3\big)$ is
    • $\text{PMI}\big((w_1, w_2), w_3\big) = \log \left( \cfrac{c(w_1, w_2, w_3)}{N} \cdot \cfrac{N}{c(w_1, w_2)} \cdot \cfrac{N}{c(w_3)} \right) = \log \cfrac{N \cdot c(w_1, w_2, w_3)}{c(w_1, w_2) \, c(w_3)}$


For $n$-grams:

  • $\text{PMI}\big((w_1, \, ... \, , w_n), w_{n+1} \big) = \log \cfrac{P(w_1, \ ... \ , w_{n+1})}{P(w_1, \ ... \ , w_n)\, P(w_{n+1})}$
  • Estimate:
  • $\text{PMI}\big((w_1, \, ... \, , w_n), w_{n+1} \big) = \log \cfrac{N \cdot c(w_1, \ ... \ , w_{n+1})}{c(w_1, \ ... \ , w_n) \, w(n_{n+1})}$



References

  • McInnes B. T. "Extending the log likelihood measure to improve collocation identification", 2004.
  • Boulis C. "Clustering of Cepstrum Coefficients Using Pairwise Mutual Information", 2002.
  • Tadić M., Šojat K. "Finding multiword term candidates in Croatian", 2003.
  • Gerlof Bouma, "Normalized (Pointwise) Mutual Information in Collocation Extraction" [1]

Sources

  • De Kok D., Brouwer H. "Natural language processing for the working programmer". 2011. Chapter 3 [2] archived version version, google scholar
  • Manning C. D., Schütze H. "Foundations of statistical natural language processing", 1999. Chapter 5 [3]
  • Petrović S. et al. "Comparison of collocation extraction measures for document indexing", 2006. [4]