ML Wiki

Statistical Language Models

A statistical language model (Language Model for short) is a probability distribution over sequences of words (i.e. over sentences)

It assigns any sentence a probability it's a probability of seeing a sentence according to the LM

Language Models can be context dependent

For example:

$p_1 = P(\text{"Today is Wednesday"}) = 0.001$
$p_2 = P(\text{"The equation has a solution"}) = 0.00001$

Although both sentences are valid, sentence 1 is more likely: for example, because the LM can model some general conversations rather than conversations on a math conference

Generative Models

Language models are generative models for text:

given a LM we can sample sentences according to the distribution and obtain a text sample
so we can "generate" text

Usage

LMs:

provide a way to quantify the uncertainly associated with natural languages
allow to answer Information Retrieval related questions e.g.
- how likely would a user use a query with word "baseball" if he wants to find about sport

Building Language Models

Can't enumerate all possible sentences, so need to make some assumptions to simplify the model

Notation:

$V = \{w_1, \ ... \ , w_N\}$ is the vocabulary, $|V| = N$

Unigrams

This is the simplest LM

a word sequence is generated by sampling each word independently
Assumption: for a sequence $[w_1, \ ... \ , w_n]$, $P(w_1, \ ... \ , w_n) = \prod_{i} P(w_i)$

Let $\theta$ be a Unigram LM

there are $|V|$ params in $\theta$
condition: $\sum p(w_i \mid \theta) = 1$
so unigram LM specifies a Multinomial Distribution over words
according to unigram models, sentences with high-probability words have higher probability

High probability words also can indicate topics

e.g. $\theta_1$ about Text Mining
$\theta_2$ about food
if a document $D$ is a text mining paper, than we expect $P(D \mid \theta_1) > P(D \mid \theta_2)$

Inference

Suppose we have $D$ generated by some unigram model $\theta$

and we want to learn that are the parameters of $\theta$: estimate $P(w \mid \theta)$ for each $w \in D$
this is a Parameter Estimation Problem
Maximum Likelihood Estimation (MLE) is a possible solution: $\hat \theta = \operatorname{arg max}_{\theta} P(D \mid \theta)$
since unigrams is a multinomial probability distribution, MLE is:
$P(w \mid \hat \theta) = \cfrac{c(w, D)}{|D|}$ where $c(w, D)$ is the count of word $w$ in document $D$ and $|D|$ is the total number of words in the document

How this formula is derived?

consider log likelihood function: $\log P(D \mid \theta) = \sum_{w \in V} c(w, D) \, \log P(w \mid \theta)$
we want to maximize it s.t. $P(w \mid \theta)$ is a Probability Distribution i.e. $\sum_{w \in V} P(w \mid \theta) = 1$
use Lagrange Multipliers to convert this constrained optimization problem into an unconstrained one
so let $L(\theta, \lambda) = \log P(D \mid \theta) + \lambda \left(1 - \sum P(w \mid \theta) \right) = \sum_{w \in V} c(w, D) \, \log P(w \mid \theta) + \lambda \left(1 - \sum P(w \mid \theta) \right)$
by solving it, we get $P(w \mid \hat \theta) = \cfrac{c(w, D)}{|D|}$

Smoothing for Language Models

MLE may overfit the data: it will assign 0 probabilities to words it hasn't seen
What to do with it?
Bayesian Parameter Estimation can both maximize the data likelihood and incorporate the prior belief to "smooth" the estimate
use MAP: Maximum A Posteriori Estimation:
$\hat \theta = \operatorname{arg max}_{\theta} P(\theta \mid D) = \operatorname{arg max}_{\theta} P(D \mid \theta) \, P(\theta)$
so we can define some prior $P(\theta)$, and depending on the choice of prior, we'd have different estimators
if the prior prefers models that don't assign 0 probability to any $w$, then at the end we won't have 0 entries
adjusting MLE to avoid 0 probability is called "smoothing" - it's a form of regularization

Usage

Popular choice in Information Retrieval

N-grams

Unigrams make unrealistic assumptions about word occurrence:

generally word order matters

Bigram models:

Have the Markov Property:
$P(w_1, \ ... \ , w_n) = P(w_1) \cdot \prod_i P(w_i \mid w_{i-1})$
can capture local dependency between two adjacent words

Usage:

better than unigrams in complex NLP tasks such as Machine Translation, Speech Recognition, etc
not much used in IR: the IR task is relatively easy and these models usually don't give much performance gain over unigrams

Evaluation of Language Models

Indirect Way

Evaluate not the LM itself, but the application where LM is used

if the LM is used in an IR system, evaluate the performance of this IR system using different LMs (e.g. with Precision and Recall)

Direct Way

Standalone evaluation of LM

Structured Models

There are also models that take into account the structure of natural languages

for example Probabilistic Context-Free Grammars allow to capture remote dependencies
these models are complex and require a lot of data to estimate the parameters
usually simpler models work just as good

Sources

Information Retrieval (UFRT)
Zhai, ChengXiang. "Statistical language models for information retrieval." 2008.

Statistical Language Models

Contents

Statistical Language Models

Generative Models

Usage

Building Language Models

Unigrams

Inference

Usage

N-grams

Evaluation of Language Models

Indirect Way

Direct Way

Structured Models

Sources