Statistical Language Models

A statistical language model (Language Model for short) is a probability distribution over sequences of words (i.e. over sentences)

It assigns any sentence a probability it's a probability of seeing a sentence according to the LM

Language Models can be context dependent

For example:

  • $p_1 = P(\text{"Today is Wednesday"}) = 0.001$
  • $p_2 = P(\text{"The equation has a solution"}) = 0.00001$

Although both sentences are valid, sentence 1 is more likely: for example, because the LM can model some general conversations rather than conversations on a math conference

Generative Models

Language models are generative models for text:

  • given a LM we can sample sentences according to the distribution and obtain a text sample
  • so we can "generate" text



  • provide a way to quantify the uncertainly associated with natural languages
  • allow to answer Information Retrieval related questions e.g.
    • how likely would a user use a query with word "baseball" if he wants to find about sport

Building Language Models

Can't enumerate all possible sentences, so need to make some assumptions to simplify the model


  • $V = \{w_1, \ ... \ , w_N\}$ is the vocabulary, $|V| = N$


This is the simplest LM

  • a word sequence is generated by sampling each word independently
  • Assumption: for a sequence $[w_1, \ ... \ , w_n]$, $P(w_1, \ ... \ , w_n) = \prod_{i} P(w_i)$

Let $\theta$ be a Unigram LM

  • there are $|V|$ params in $\theta$
  • condition: $\sum p(w_i \mid \theta) = 1$
  • so unigram LM specifies a Multinomial Distribution over words
  • according to unigram models, sentences with high-probability words have higher probability

High probability words also can indicate topics

  • e.g. $\theta_1$ about Text Mining
  • $\theta_2$ about food
  • if a document $D$ is a text mining paper, than we expect $P(D \mid \theta_1) > P(D \mid \theta_2)$


Suppose we have $D$ generated by some unigram model $\theta$

  • and we want to learn that are the parameters of $\theta$: estimate $P(w \mid \theta)$ for each $w \in D$
  • this is a Parameter Estimation Problem
  • Maximum Likelihood Estimation (MLE) is a possible solution: $\hat \theta = \operatorname{arg max}_{\theta} P(D \mid \theta)$
  • since unigrams is a multinomial probability distribution, MLE is:
  • $P(w \mid \hat \theta) = \cfrac{c(w, D)}{|D|}$ where $c(w, D)$ is the count of word $w$ in document $D$ and $|D|$ is the total number of words in the document

How this formula is derived?

  • consider log likelihood function: $\log P(D \mid \theta) = \sum_{w \in V} c(w, D) \, \log P(w \mid \theta)$
  • we want to maximize it s.t. $P(w \mid \theta)$ is a Probability Distribution i.e. $\sum_{w \in V} P(w \mid \theta) = 1$
  • use Lagrange Multipliers to convert this constrained optimization problem into an unconstrained one
  • so let $L(\theta, \lambda) = \log P(D \mid \theta) + \lambda \left(1 - \sum P(w \mid \theta) \right) = \sum_{w \in V} c(w, D) \, \log P(w \mid \theta) + \lambda \left(1 - \sum P(w \mid \theta) \right)$
  • by solving it, we get $P(w \mid \hat \theta) = \cfrac{c(w, D)}{|D|}$

Smoothing for Language Models

  • MLE may overfit the data: it will assign 0 probabilities to words it hasn't seen
  • What to do with it?
  • Bayesian Parameter Estimation can both maximize the data likelihood and incorporate the prior belief to "smooth" the estimate
  • use MAP: Maximum A Posteriori Estimation:
  • $\hat \theta = \operatorname{arg max}_{\theta} P(\theta \mid D) = \operatorname{arg max}_{\theta} P(D \mid \theta) \, P(\theta)$
  • so we can define some prior $P(\theta)$, and depending on the choice of prior, we'd have different estimators
  • if the prior prefers models that don't assign 0 probability to any $w$, then at the end we won't have 0 entries
  • adjusting MLE to avoid 0 probability is called "smoothing" - it's a form of regularization



Unigrams make unrealistic assumptions about word occurrence:

  • generally word order matters

Bigram models:

  • Have the Markov Property:
  • $P(w_1, \ ... \ , w_n) = P(w_1) \cdot \prod_i P(w_i \mid w_{i-1})$
  • can capture local dependency between two adjacent words


  • better than unigrams in complex NLP tasks such as Machine Translation, Speech Recognition, etc
  • not much used in IR: the IR task is relatively easy and these models usually don't give much performance gain over unigrams

Evaluation of Language Models

Indirect Way

Evaluate not the LM itself, but the application where LM is used

  • if the LM is used in an IR system, evaluate the performance of this IR system using different LMs (e.g. with Precision and Recall)

Direct Way

Standalone evaluation of LM

Structured Models

There are also models that take into account the structure of natural languages

  • for example Probabilistic Context-Free Grammars allow to capture remote dependencies
  • these models are complex and require a lot of data to estimate the parameters
  • usually simpler models work just as good