Smoothing for Language Models

It's a form of Regularization for Statistical Language Models


Parameter Estimation

Suppose $\theta$ is a Unigram Statistical Language Model

  • so $\theta$ follows Multinomial Distribution
  • $D$ is a document consisting of words: $D = \{w_1, \ ... \ , w_m \}$
  • $V$ is the vocabulary of the model: $V = \{ w_1, \ ... \ , w_M \}$
  • By the unigram model, each word is independent, so
  • $P(D \mid \theta) = \prod_i P(w_i \mid \theta) = \prod_{w \in V} P(w \mid \theta)^{c(w, D)}$
  • where $c(w, D)$ is the term frequency: how many times $w$ occurs in $D$ (see also TF-IDF)
  • how do we estimate $P(w \mid \theta)$?

With MLE, we have:

$\hat p_\text{ML} (w \mid \theta) = \cfrac{c(w, D)}{\sum_{w \in V} c(w, D)} = \cfrac{c(w, D)}{| D |}$

No smoothing


Smoothing

  • MLE may overfit the data: it will assign 0 probabilities to words it hasn't seen
  • What to do with it?
  • Bayesian Parameter Estimation can both maximize the data likelihood and incorporate the prior belief to "smooth" the estimate
  • use MAP: Maximum A Posteriori Estimation:
  • $\hat \theta = \operatorname{arg max}_{\theta} P(\theta \mid D) = \operatorname{arg max}_{\theta} P(D \mid \theta) \, P(\theta)$
  • so we can define some prior $P(\theta)$, and depending on the choice of prior, we'd have different estimators
  • if the prior prefers models that don't assign 0 probability to any $w$, then at the end we won't have 0 entries
  • adjusting MLE to avoid 0 probability is called "smoothing" - it's a form of regularization


Interpolation Smoothing

Discount some probability mass of seen words

  • then discounted probability is shared between all words: seen and unseen
  • so it's some sort of interpolation between LME probabilities and prior/collection model


Additive Smoothing

Laplace Smoothing (or Additive Smoothing):

  • $\hat p_\lambda (w \mid \theta) = \cfrac{c(w, D) + \lambda}{\sum_{w \in V} c(w, D)} = \cfrac{c(w, D)}{| D | + \lambda \, | V | }$
  • so it gives the same probability mass $\cfrac{\lambda}{|D| + \lambda \, |V|}$ to all unseen words

If $\lambda = 1$ then we have "+1 Smoothing"


Collection Smoothing

  • Additive smoothing gives the same probability mass $\cfrac{\lambda}{|D| + \lambda \, |V|}$ to all unseen words
  • it may not be what we want: maybe we want to give more or less weight to certain words
  • so the idea is to have some reference language model
  • if we have a corpus, then we can use this corpus to learn the LM on the entire corpus
  • such corpus LM is called "Collection LM" or "Background LM"


Collection LM

There are two ways of building the Collection LM:

  • let $P(w \mid C)$ denote the collection LM


1) Each word contributes equally

  • $P(w \mid C) = \cfrac{\sum_{D \in C} c(w, D)}{ \sum_{D \in C} |D|}$
  • it's the same as if we concatenated all documents in $C$ into one
  • "Macro-averaging"


2) Each document contribute equally

  • $P(w \mid C) = \cfrac{1}{N} \sum_{D \in C} \cfrac{c(w, D)}{ |D|}$
  • average contribution of each doc
  • "Micro-averaging"

Approach (1) is more popular than (2)


Smoothing with Collection LM

Once we learned the Collection LM, we can use it to smooth the probabilities:

$$P(w \mid \hat \theta) = \begin{cases} P_{\lambda} (w \mid \theta) & \text{ if } w \in D\\ \alpha \cdot P(w \mid C) & \text{ else } \end{cases}$$


Where

  • $P_{\lambda} (w \mid \theta)$ smoothed probabilities (with Laplace Smoothing)
  • $\alpha$ coefficient that controls how much prob. mass is assigned to unseen words
  • One way: $\alpha = \cfrac{1 - \sum_{w : \ c(w, D) > 0} P_{\lambda}(w \mid \theta) }{1 - \sum_{w : \ c(w, D) > 0} P(w \mid \theta)}$
  • When we're doing Laplace smoothing, we take some probability mass from each seen words and re-distribute it evenly
  • here we distribute it according to Collection LM


Jelinek-Mercer Smoothing

Or "Fixed Coefficient Interpolation"

Interpolate MLE with the collection LM

  • use some coefficient of interpolation $\beta$
  • $P_\beta(w \mid \hat \theta) = (1 - \beta) \, \cfrac{c(w, D)}{|D|} + \beta \, P(w \mid C)$


Dirichlet Prior Smoothing

It's a Bayesian Smoothing with special prior: Dirichlet Distribution

  • $\text{Dir}(\theta \mid \boldsymbol \alpha) = \cfrac{\Gamma \left( \sum_{i} \alpha_i \right)}{\prod_i \Gamma(\alpha_1)} \cdot \prod_i \theta_{i}^{\alpha_i - 1}$
  • params: $\boldsymbol \alpha = (\alpha_1, \ ... \ , \alpha_{|V|})$
  • let $\alpha_i = \mu \cdot P(w_i \mid C)$, $\mu$ - param, $P(w_i \mid C)$


Dirichlet is a Conjugate Prior for Multinomial Distribution

  • it means that the prior has the same functional form as the likelihood


Posterior:

  • $P(\theta \mid D) \propto \prod_{w \in V} P(w \mid \theta)^{c(w, D) + \mu \, P(w \mid C) - 1}$
  • posterior is also Dirichlet distribution with $\alpha_i = c(w_i, D) + \mu \, P(w \mid C)$


Dirichlet Smoothing:

  • $P_\mu(w \mid \hat \theta) = \cfrac{c(w, D) + \mu \, P(w \mid C)}{|D| + \mu} = \cfrac{|D|}{|D| + \mu} \cdot \cfrac{c(w, D)}{|D|} + \cfrac{\mu}{\mu + |D|} \cdot P(w \mid C)$
  • Compare with Jelinek-Mercer: same if $\beta = \cfrac{\mu}{\mu + |D|}$


"Eventually, data overrides the prior":

  • for a fixed $\mu$ longer documents will get less smoothing
  • as $|D| \to \infty$, smoothing $\to 0$


Notes:

  • the smoothing adds a pseudo count $\mu P(w \mid C)$ to each word
  • thus Additive Smoothing is a special case of Dirichlet smoothing with uniform Collection LM


Absolute Discounting Smoothing

  • $P_\delta (w \mid \hat \theta) = \cfrac{\max \big( c(w, D) - \delta, 0 \big) }{\sum_{w' \in V} c(w', D)} + \sigma \, P(w \mid C)$
  • $\delta \in [0, 1]$ discounting factor
  • $\sigma = \delta \cfrac{|D|_U}{|D|}$ where $|D|_U$ is number of unique terms in $D$ and $|D|$ is total word count


Backoff

Interpolation:

  • discount some probability mass from seen words, reassing it to both seen and unseen
  • problem with this approach: some words may end up with counts even higher than the original
  • for example, if a word is frequent in the collection LM


Alternative Strategy: Back Off


Other Smoothing Methods

Good-Turing Smoothing

Idea:

  • # of unseen events = # of "singletons": words that occur only once
  • let $\hat{c}(w, D)$ be the adjusted count of $w$
  • then $P(w \mid \hat \theta) = \cfrac{\hat{c}(w, D)}{|D|}$


What is $\hat{c}(w, D)$?

  • let $n_r$ denote # of words that occur $r$ times in $D$
  • then the adjusted is done via:
  • $\hat{c}(w, D) \, n_{c(w, D)} = \big(c(w, D) + 1\big) \, n_{c(w, D) + 1}$


Intuition

  • let's pretend that none of the singletones were observed
  • use this to estimate the total # of unseen words


Improvements:

  • Gale, William, and Geoffrey Sampson. "Good-Turing smoothing without tears." 1995. [1]


Smoothing vs TF-IDF

Smoothing and TF-IDF are connected

  • also see probabilistic justification for TF-IDF in
    • Hiemstra, Djoerd. "A probabilistic justification for using tf× idf term weighting in information retrieval." 2000. [2]


Let's derive a query retrieval function using the smoothed log likelihood:

  • $Q$ is a query
  • assuming the general smoothing scheme: (comparing $Q$ with each $D$)
  • $\log P(Q \mid \theta) = \sum_{w \in V} c(w, Q) \, \log P(w \mid \theta) = \sum_{w \in D} c(w, Q) \, \log P_S(w \mid \theta) + \sum_{w \not \in D} c(w, Q) \, \alpha \log P(w \mid \theta)$
  • $\sum_{w \not \in D} c(w, Q) \, \alpha \log P(w \mid \theta) = \sum_{w \in V} c(w, Q) \, \alpha \log P(w \mid \theta) - \sum_{w \in D} c(w, Q) \, \alpha \log P(w \mid \theta)$:
    • words that are not in the document = all words - words that are in the document
  • let's regroup it:
  • $\log P(Q \mid \theta) = \sum_{w \in D} c(w, Q) \, \log \cfrac{P_S (w \mid \theta) }{\alpha \, P(w \mid C)} + |Q| \, \log \alpha + \sum_{w \in V} c(w, Q) \, \alpha \log P(w \mid \theta) = \ ...$
  • can ignore the last term $\sum_{w \in V} c(w, Q) \, \alpha \log P(w \mid \theta)$ because it will not affect the ranking
  • thus we're left with
  • $\log P(Q \mid \theta) \mathop{=}\limits^{\text{rank}} \sum_{w \in D} c(w, Q) \, \log \cfrac{P_S (w \mid \theta) }{\alpha \, P(w \mid C)} + |Q| \, \log \alpha$


Observe:

  • form of this smoothed retrieval function is similar to TF-IDF:
  • first term: $\sum_{w \in D} c(w, Q) \, \log \cfrac{P_S (w \mid \theta) }{\alpha \, P(w \mid C)}$
  • it sums over all matched terms - ones with $c(w, Q) > 0$
  • $P_S (w \mid \theta)$ would be larger for words with high TF ($\approx$ TF heuristic)
  • frequent items in collection would have high $P(w \mid C)$ and thus smaller overall weight ($\approx$ IDF heuristic)


Other Smoothing Ideas

Clustering / KNN Smoothing

Smoothing all documents from $C$ with the same collection LM may be not the most optimal approach

  • maybe need to try more "individual" approaches


Can try:

  • cluster all documents prior indexing, build a cluster LM for each cluster, and then smooth documents using their associated cluster LM
  • find KNN docs, and then smooth using them



References

  • Chen, Stanley F., and Joshua Goodman. "An empirical study of smoothing techniques for language modeling." 1996 [3] and 1999 [4].


Sources

  • Zhai, ChengXiang. "Statistical language models for information retrieval." 2008.