Statistical Language Models
A statistical language model (Language Model for short) is a probability distribution over sequences of words (i.e. over sentences)
It assigns any sentence a probability
it's a probability of seeing a sentence according to the LM
Language Models can be context dependent
For example:
- $p_1 = P(\text{"Today is Wednesday"}) = 0.001$
- $p_2 = P(\text{"The equation has a solution"}) = 0.00001$
Although both sentences are valid, sentence 1 is more likely: for example, because the LM can model some general conversations rather than conversations on a math conference
Language models are generative models for text:
- given a LM we can sample sentences according to the distribution and obtain a text sample
- so we can "generate" text
Usage
LMs:
- provide a way to quantify the uncertainly associated with natural languages
- allow to answer Information Retrieval related questions e.g.
- how likely would a user use a query with word "baseball" if he wants to find about sport
Building Language Models
Can't enumerate all possible sentences, so need to make some assumptions to simplify the model
Notation:
- $V = \{w_1, \ ... \ , w_N\}$ is the vocabulary, $|V| = N$
Unigrams
This is the simplest LM
- a word sequence is generated by sampling each word independently
- Assumption: for a sequence $[w_1, \ ... \ , w_n]$, $P(w_1, \ ... \ , w_n) = \prod_{i} P(w_i)$
Let $\theta$ be a Unigram LM
- there are $|V|$ params in $\theta$
- condition: $\sum p(w_i \mid \theta) = 1$
- so unigram LM specifies a Multinomial Distribution over words
- according to unigram models, sentences with high-probability words have higher probability
High probability words also can indicate topics
- e.g. $\theta_1$ about Text Mining
- $\theta_2$ about food
- if a document $D$ is a text mining paper, than we expect $P(D \mid \theta_1) > P(D \mid \theta_2)$
Inference
Suppose we have $D$ generated by some unigram model $\theta$
- and we want to learn that are the parameters of $\theta$: estimate $P(w \mid \theta)$ for each $w \in D$
- this is a Parameter Estimation Problem
- Maximum Likelihood Estimation (MLE) is a possible solution: $\hat \theta = \operatorname{arg max}_{\theta} P(D \mid \theta)$
- since unigrams is a multinomial probability distribution, MLE is:
- $P(w \mid \hat \theta) = \cfrac{c(w, D)}{|D|}$ where $c(w, D)$ is the count of word $w$ in document $D$ and $|D|$ is the total number of words in the document
How this formula is derived?
- consider log likelihood function: $\log P(D \mid \theta) = \sum_{w \in V} c(w, D) \, \log P(w \mid \theta)$
- we want to maximize it s.t. $P(w \mid \theta)$ is a Probability Distribution i.e. $\sum_{w \in V} P(w \mid \theta) = 1$
- use Lagrange Multipliers to convert this constrained optimization problem into an unconstrained one
- so let $L(\theta, \lambda) = \log P(D \mid \theta) + \lambda \left(1 - \sum P(w \mid \theta) \right) = \sum_{w \in V} c(w, D) \, \log P(w \mid \theta) + \lambda \left(1 - \sum P(w \mid \theta) \right)$
- by solving it, we get $P(w \mid \hat \theta) = \cfrac{c(w, D)}{|D|}$
Smoothing for Language Models
- MLE may overfit the data: it will assign 0 probabilities to words it hasn't seen
- What to do with it?
- Bayesian Parameter Estimation can both maximize the data likelihood and incorporate the prior belief to "smooth" the estimate
- use MAP: Maximum A Posteriori Estimation:
- $\hat \theta = \operatorname{arg max}_{\theta} P(\theta \mid D) = \operatorname{arg max}_{\theta} P(D \mid \theta) \, P(\theta)$
- so we can define some prior $P(\theta)$, and depending on the choice of prior, we'd have different estimators
- if the prior prefers models that don't assign 0 probability to any $w$, then at the end we won't have 0 entries
- adjusting MLE to avoid 0 probability is called "smoothing" - it's a form of regularization
Usage
N-grams
Unigrams make unrealistic assumptions about word occurrence:
- generally word order matters
Bigram models:
- Have the Markov Property:
- $P(w_1, \ ... \ , w_n) = P(w_1) \cdot \prod_i P(w_i \mid w_{i-1})$
- can capture local dependency between two adjacent words
Usage:
- better than unigrams in complex NLP tasks such as Machine Translation, Speech Recognition, etc
- not much used in IR: the IR task is relatively easy and these models usually don't give much performance gain over unigrams
Evaluation of Language Models
Indirect Way
Evaluate not the LM itself, but the application where LM is used
- if the LM is used in an IR system, evaluate the performance of this IR system using different LMs (e.g. with Precision and Recall)
-
Direct Way
Standalone evaluation of LM
Structured Models
There are also models that take into account the structure of natural languages
- for example Probabilistic Context-Free Grammars allow to capture remote dependencies
- these models are complex and require a lot of data to estimate the parameters
- usually simpler models work just as good
Sources