# ML Wiki

## Smoothing for Language Models

It's a form of Regularization for Statistical Language Models

## Parameter Estimation

Suppose $\theta$ is a Unigram Statistical Language Model

• so $\theta$ follows Multinomial Distribution
• $D$ is a document consisting of words: $D = \{w_1, \ ... \ , w_m \}$
• $V$ is the vocabulary of the model: $V = \{ w_1, \ ... \ , w_M \}$
• By the unigram model, each word is independent, so
• $P(D \mid \theta) = \prod_i P(w_i \mid \theta) = \prod_{w \in V} P(w \mid \theta)^{c(w, D)}$
• where $c(w, D)$ is the term frequency: how many times $w$ occurs in $D$ (see also TF-IDF)
• how do we estimate $P(w \mid \theta)$?

With MLE, we have:

$\hat p_\text{ML} (w \mid \theta) = \cfrac{c(w, D)}{\sum_{w \in V} c(w, D)} = \cfrac{c(w, D)}{| D |}$

No smoothing

• MLE may overfit the data: it will assign 0 probabilities to words it hasn't seen
• What to do with it?
• Bayesian Parameter Estimation can both maximize the data likelihood and incorporate the prior belief to "smooth" the estimate
• use MAP: Maximum A Posteriori Estimation:
• $\hat \theta = \operatorname{arg max}_{\theta} P(\theta \mid D) = \operatorname{arg max}_{\theta} P(D \mid \theta) \, P(\theta)$
• so we can define some prior $P(\theta)$, and depending on the choice of prior, we'd have different estimators
• if the prior prefers models that don't assign 0 probability to any $w$, then at the end we won't have 0 entries
• adjusting MLE to avoid 0 probability is called "smoothing" - it's a form of regularization

## Interpolation Smoothing

Discount some probability mass of seen words

• then discounted probability is shared between all words: seen and unseen
• so it's some sort of interpolation between LME probabilities and prior/collection model

• $\hat p_\lambda (w \mid \theta) = \cfrac{c(w, D) + \lambda}{\sum_{w \in V} c(w, D)} = \cfrac{c(w, D)}{| D | + \lambda \, | V | }$
• so it gives the same probability mass $\cfrac{\lambda}{|D| + \lambda \, |V|}$ to all unseen words

If $\lambda = 1$ then we have "+1 Smoothing"

### Collection Smoothing

• Additive smoothing gives the same probability mass $\cfrac{\lambda}{|D| + \lambda \, |V|}$ to all unseen words
• it may not be what we want: maybe we want to give more or less weight to certain words
• so the idea is to have some reference language model
• if we have a corpus, then we can use this corpus to learn the LM on the entire corpus
• such corpus LM is called "Collection LM" or "Background LM"

#### Collection LM

There are two ways of building the Collection LM:

• let $P(w \mid C)$ denote the collection LM

1) Each word contributes equally

• $P(w \mid C) = \cfrac{\sum_{D \in C} c(w, D)}{ \sum_{D \in C} |D|}$
• it's the same as if we concatenated all documents in $C$ into one
• "Macro-averaging"

2) Each document contribute equally

• $P(w \mid C) = \cfrac{1}{N} \sum_{D \in C} \cfrac{c(w, D)}{ |D|}$
• average contribution of each doc
• "Micro-averaging"

Approach (1) is more popular than (2)

#### Smoothing with Collection LM

Once we learned the Collection LM, we can use it to smooth the probabilities:

$$P(w \mid \hat \theta) = \begin{cases} P_{\lambda} (w \mid \theta) & \text{ if } w \in D\\ \alpha \cdot P(w \mid C) & \text{ else } \end{cases}$$

Where

• $P_{\lambda} (w \mid \theta)$ smoothed probabilities (with Laplace Smoothing)
• $\alpha$ coefficient that controls how much prob. mass is assigned to unseen words
• One way: $\alpha = \cfrac{1 - \sum_{w : \ c(w, D) > 0} P_{\lambda}(w \mid \theta) }{1 - \sum_{w : \ c(w, D) > 0} P(w \mid \theta)}$
• When we're doing Laplace smoothing, we take some probability mass from each seen words and re-distribute it evenly
• here we distribute it according to Collection LM

### Jelinek-Mercer Smoothing

Or "Fixed Coefficient Interpolation"

Interpolate MLE with the collection LM

• use some coefficient of interpolation $\beta$
• $P_\beta(w \mid \hat \theta) = (1 - \beta) \, \cfrac{c(w, D)}{|D|} + \beta \, P(w \mid C)$

### Dirichlet Prior Smoothing

It's a Bayesian Smoothing with special prior: Dirichlet Distribution

• $\text{Dir}(\theta \mid \boldsymbol \alpha) = \cfrac{\Gamma \left( \sum_{i} \alpha_i \right)}{\prod_i \Gamma(\alpha_1)} \cdot \prod_i \theta_{i}^{\alpha_i - 1}$
• params: $\boldsymbol \alpha = (\alpha_1, \ ... \ , \alpha_{|V|})$
• let $\alpha_i = \mu \cdot P(w_i \mid C)$, $\mu$ - param, $P(w_i \mid C)$

Dirichlet is a Conjugate Prior for Multinomial Distribution

• it means that the prior has the same functional form as the likelihood

Posterior:

• $P(\theta \mid D) \propto \prod_{w \in V} P(w \mid \theta)^{c(w, D) + \mu \, P(w \mid C) - 1}$
• posterior is also Dirichlet distribution with $\alpha_i = c(w_i, D) + \mu \, P(w \mid C)$

Dirichlet Smoothing:

• $P_\mu(w \mid \hat \theta) = \cfrac{c(w, D) + \mu \, P(w \mid C)}{|D| + \mu} = \cfrac{|D|}{|D| + \mu} \cdot \cfrac{c(w, D)}{|D|} + \cfrac{\mu}{\mu + |D|} \cdot P(w \mid C)$
• Compare with Jelinek-Mercer: same if $\beta = \cfrac{\mu}{\mu + |D|}$

"Eventually, data overrides the prior":

• for a fixed $\mu$ longer documents will get less smoothing
• as $|D| \to \infty$, smoothing $\to 0$

Notes:

• the smoothing adds a pseudo count $\mu P(w \mid C)$ to each word
• thus Additive Smoothing is a special case of Dirichlet smoothing with uniform Collection LM

### Absolute Discounting Smoothing

• $P_\delta (w \mid \hat \theta) = \cfrac{\max \big( c(w, D) - \delta, 0 \big) }{\sum_{w' \in V} c(w', D)} + \sigma \, P(w \mid C)$
• $\delta \in [0, 1]$ discounting factor
• $\sigma = \delta \cfrac{|D|_U}{|D|}$ where $|D|_U$ is number of unique terms in $D$ and $|D|$ is total word count

## Backoff

Interpolation:

• discount some probability mass from seen words, reassing it to both seen and unseen
• problem with this approach: some words may end up with counts even higher than the original
• for example, if a word is frequent in the collection LM

Alternative Strategy: Back Off

## Other Smoothing Methods

### Good-Turing Smoothing

Idea:

• # of unseen events = # of "singletons": words that occur only once
• let $\hat{c}(w, D)$ be the adjusted count of $w$
• then $P(w \mid \hat \theta) = \cfrac{\hat{c}(w, D)}{|D|}$

What is $\hat{c}(w, D)$?

• let $n_r$ denote # of words that occur $r$ times in $D$
• then the adjusted is done via:
• $\hat{c}(w, D) \, n_{c(w, D)} = \big(c(w, D) + 1\big) \, n_{c(w, D) + 1}$

Intuition

• let's pretend that none of the singletones were observed
• use this to estimate the total # of unseen words

Improvements:

• Gale, William, and Geoffrey Sampson. "Good-Turing smoothing without tears." 1995. [1]

## Smoothing vs TF-IDF

Smoothing and TF-IDF are connected

• also see probabilistic justification for TF-IDF in
• Hiemstra, Djoerd. "A probabilistic justification for using tf× idf term weighting in information retrieval." 2000. [2]

Let's derive a query retrieval function using the smoothed log likelihood:

• $Q$ is a query
• assuming the general smoothing scheme: (comparing $Q$ with each $D$)
• $\log P(Q \mid \theta) = \sum_{w \in V} c(w, Q) \, \log P(w \mid \theta) = \sum_{w \in D} c(w, Q) \, \log P_S(w \mid \theta) + \sum_{w \not \in D} c(w, Q) \, \alpha \log P(w \mid \theta)$
• $\sum_{w \not \in D} c(w, Q) \, \alpha \log P(w \mid \theta) = \sum_{w \in V} c(w, Q) \, \alpha \log P(w \mid \theta) - \sum_{w \in D} c(w, Q) \, \alpha \log P(w \mid \theta)$:
• words that are not in the document = all words - words that are in the document
• let's regroup it:
• $\log P(Q \mid \theta) = \sum_{w \in D} c(w, Q) \, \log \cfrac{P_S (w \mid \theta) }{\alpha \, P(w \mid C)} + |Q| \, \log \alpha + \sum_{w \in V} c(w, Q) \, \alpha \log P(w \mid \theta) = \ ...$
• can ignore the last term $\sum_{w \in V} c(w, Q) \, \alpha \log P(w \mid \theta)$ because it will not affect the ranking
• thus we're left with
• $\log P(Q \mid \theta) \mathop{=}\limits^{\text{rank}} \sum_{w \in D} c(w, Q) \, \log \cfrac{P_S (w \mid \theta) }{\alpha \, P(w \mid C)} + |Q| \, \log \alpha$

Observe:

• form of this smoothed retrieval function is similar to TF-IDF:
• first term: $\sum_{w \in D} c(w, Q) \, \log \cfrac{P_S (w \mid \theta) }{\alpha \, P(w \mid C)}$
• it sums over all matched terms - ones with $c(w, Q) > 0$
• $P_S (w \mid \theta)$ would be larger for words with high TF ($\approx$ TF heuristic)
• frequent items in collection would have high $P(w \mid C)$ and thus smaller overall weight ($\approx$ IDF heuristic)

## Other Smoothing Ideas

### Clustering / KNN Smoothing

Smoothing all documents from $C$ with the same collection LM may be not the most optimal approach

• maybe need to try more "individual" approaches

Can try:

• cluster all documents prior indexing, build a cluster LM for each cluster, and then smooth documents using their associated cluster LM
• find KNN docs, and then smooth using them

## References

• Chen, Stanley F., and Joshua Goodman. "An empirical study of smoothing techniques for language modeling." 1996 [3] and 1999 [4].

## Sources

• Zhai, ChengXiang. "Statistical language models for information retrieval." 2008.