There's no commonly accepted definition of "collocation", but it's usually a multiple-word token (i.e. an $n$-gram) where
- the words of the ngram co-occur together more often than by chance alone
- usually they mean something collectively, something different from what each word means alone
Other names for collocations:
- compound terms
- multi-word terms
- $n$-grams
Types of Collocations
There are two types of collocations:
- Open Compounds
- Compositional Collocation
Open Compounds
- uninterrupted sequences of words that generally function as a single "word" in the text
- e.g. "stock market", "foreign exchange"
Compositional Collocation
- a sequence interrupted by preposition or a conjunction
- e.g. "cure for cancer", "guns and ammunitions"
Applications
There are following applications for Collocation Extraction
Information Retrieval
- in IR these words may represent documents better than unigram tokens
- some tri-grams and higher order $n$-grams may be good for IR, even though they aren't compound term - in the sense of the definitions above.
Identification of Collocations
How to find and extract collocations?
- by "co-occur more often than by chance" we can mean independence
- so we can test if the words in the collocation are independent or not
- if they aren't - then maybe it's a collocation
- we also can use some measures than quantify the dependence between words - and the higher the measure, the more likely the words form a collocation
Two Word Collocations
Let us start by considering the bigram collocation case
- suppose the collocation candidate is $(w_1, w_2)$
- if both $w_1$ and $w_2$ are frequent, they may co-locate just by chance, even if they don't form a collocation
- we want to test if this collocation occurs significantly more often than by chance
- for that we can compare how often $w_1$ and $w_2$ occur independently, and how often they occur together
- can do Hypothesis Testing for that
General framework:
- under the Independence hypothesis ($H_0$) we assume that there is no association between $w_1$ and $w_2$, i.e. they are independent
- let $P(w_1)$ and $P(w_2)$ are probabilities that a random token in a text is $w_1$ and $w_2$ resp.
- and $P(w_1, w_2)$ is the probability that $(w_1, w_2)$ occur together in the text (i.e. one follows another)
- so under $H_0$, $P(w_1, w_2) = P(w_1)\, P(w_2)$
- we can compute the observed probability of $P(w_1, w_2)$ and compare it with the probability under $H_0$
- if these probabilities are significantly different from each other, then $(w_1, w_2)$ is a collocation
Ranking candidates
- if we can measure the degree of dependence, we can rank the candidate collocation
- usually tests have some test statistics which we can use for ranking candidates
- often it is more interesting to look at the top ranking candidates rather than at all of them
Ways to test/rank:
$t$-test can also be used for collocation discovery
- it looks at the mean and variance
- can use $t$-test for proportions: $t = \cfrac{\bar x - \mu}{\sqrt{s^2 / N}}$
The idea:
- The probability of generating a bigram $(w_1, w_2)$ is a Bernoulli Random Variable with $p= P(w_1, w_2)$
- for Bernoulli, $\mu = p$ and $\sigma^2 = p\, (1-p)$
- because $p$ is typically quite small, $\sigma^2 \approx p$ for most bigrams
- then we can calculate the $t$-value and the corresponding critical value ($N$ = number of unigrams)
- and if $t$ is large enough, reject $H_0$
- but in this case, it's more useful to rank values by their $t$ rather than just know if a bigram passes a test or not
Other Tests
Other tests can also be used
it's applied to 2-by-2 table
- in essence, it compares observed frequencies to expected frequencies
- if the difference is large - reject $H_0$
- $X^2 = \sum_{ij} \cfrac{(O_{ij} - E_{ij})^2}{E_{ij}}$
- $X^2$ is distributed as $\chi^2$
To compare $P(w_1, w_2)$ with $P(w_1) \, P(w_2)$ we can use the Odds Ratio:
- Odds Ratio is $\cfrac{P(w_1, w_2)}{P(w_1) \, P(w_2)}$
- collocations should have scores higher than 1
- the higher the ratio, the more likely a bigram is a collocation
Mutual Information:
- it's a measure on how much one word tells about the other
- In Information Theory, Mutual Information is defined between Random Variables, not words (values of RVs)
- which is why we use PMI
Point-Wise Mutual Information (PMI):
- instead of Odds Ratio, can use Log Odds:
- $\log \cfrac{P(w_1, w_2)}{P(w_1) \, P(w_2)}$
- then it becomes Point-Wise Mutual Information
- $\text{PMI}(w_1, w_2) = \log \cfrac{P(w_1, w_2)}{P(w_1)\, P(w_2)} = \log \cfrac{P(w_1) \, P(w_2 \mid w_1)}{P(w_1)\, P(w_2)} = \log \cfrac{P(w_2 \mid w_1)}{P(w_2)}$
PMI is the amount of information we get when
- a word $w_1$ occurs at position $i$
- and it's followed by $w_2$ at position $i+1$
PMI can generalize to any $n$-grams
- suppose $\mathbf x$ and $\mathbf y$ are vectors (not necessarily of the same dimensions)
- then $\text{PMI}(\mathbf x, \mathbf y) = \log \cfrac{P(\mathbf x, \mathbf y)}{P(\mathbf x)\, P(\mathbf y)}$
- also see the #$n$-Gram Collocations section
Estimates
Estimation of $P(w_1)$, $P(w_2)$ and $P(w_1, w_2)$:
MLE of $P(w_1)$ and $P(w_2)$:
- $\hat P(w) = \cfrac{c(w)}{N}$, where
- $c(w)$ is the number of times $w_1$ appeared in the corpus, and
- $N$ is the total number of tokens
MLE of $P(w_1, w_2)$
- $P(w_1, w_2) = P(w_1) \, P(w_2 \mid w_1)$, so
- $\hat P(w_1, w_2) = \hat P(w_1) \, \hat P(w_2 \mid w_1) = \cfrac{c(w_1)}{N} \cdot \cfrac{c(w_1, w_2)}{c(w_1)} = \cfrac{c(w_1, w_2)}{N}$
so MLE estimate of PMI is
- $\text{PMI}(w_1, w_2) = \log \cfrac{\hat P(w_2 \mid w_1)}{\hat P(w_2)} = \log \left( \cfrac{c(w_1, w_2)}{c(w_1)} \cdot \cfrac{N}{c(w_2)} \right)$
$n$-Gram Collocations
For 3-grams:
- treat $(w_1, w_2)$ as a single var
- $\text{PMI}\Big((w_1, w_2), w_3 \Big) = \log \cfrac{P\big((w_1, w_2), w_3\big)}{P(w_1, w_2)\, P(w_3)} = \log \cfrac{P\big(w_1, w_2, w_3\big)}{P(w_1, w_2)\, P(w_3)}$
- $P(w_1, w_2, w_3) = P(w_1) \, P(w_2 \mid w_1) \, P(w_3 \mid w_1, w_2)$
- $\hat P(w_1, w_2, w_3) = \cfrac{c(w_1)}{N} \cdot \cfrac{c(w_1, w_2)}{c(w_1)} \cdot \cfrac{c(w_1, w_2, w_3)}{c(w_1, w_2)} = \cfrac{c(w_1, w_2, w_3)}{N}$
- So estimate of $\text{PMI}\big((w_1, w_2), w_3\big)$ is
- $\text{PMI}\big((w_1, w_2), w_3\big) = \log \left( \cfrac{c(w_1, w_2, w_3)}{N} \cdot \cfrac{N}{c(w_1, w_2)} \cdot \cfrac{N}{c(w_3)} \right) = \log \cfrac{N \cdot c(w_1, w_2, w_3)}{c(w_1, w_2) \, c(w_3)}$
For $n$-grams:
- $\text{PMI}\big((w_1, \, ... \, , w_n), w_{n+1} \big) = \log \cfrac{P(w_1, \ ... \ , w_{n+1})}{P(w_1, \ ... \ , w_n)\, P(w_{n+1})}$
- Estimate:
- $\text{PMI}\big((w_1, \, ... \, , w_n), w_{n+1} \big) = \log \cfrac{N \cdot c(w_1, \ ... \ , w_{n+1})}{c(w_1, \ ... \ , w_n) \, w(n_{n+1})}$
References
- McInnes B. T. "Extending the log likelihood measure to improve collocation identification", 2004.
- Boulis C. "Clustering of Cepstrum Coefficients Using Pairwise Mutual Information", 2002.
- Tadić M., Šojat K. "Finding multiword term candidates in Croatian", 2003.
- Gerlof Bouma, "Normalized (Pointwise) Mutual Information in Collocation Extraction" [1]
Sources
- De Kok D., Brouwer H. "Natural language processing for the working programmer". 2011. Chapter 3 [2] archived version version, google scholar
- Manning C. D., Schütze H. "Foundations of statistical natural language processing", 1999. Chapter 5 [3]
- Petrović S. et al. "Comparison of collocation extraction measures for document indexing", 2006. [4]