# ML Wiki

## Statistical Language Models

A statistical language model (Language Model for short) is a probability distribution over sequences of words (i.e. over sentences)

It assigns any sentence a probability it's a probability of seeing a sentence according to the LM

Language Models can be context dependent

For example:

• $p_1 = P(\text{"Today is Wednesday"}) = 0.001$
• $p_2 = P(\text{"The equation has a solution"}) = 0.00001$

Although both sentences are valid, sentence 1 is more likely: for example, because the LM can model some general conversations rather than conversations on a math conference

### Generative Models

Language models are generative models for text:

• given a LM we can sample sentences according to the distribution and obtain a text sample
• so we can "generate" text

### Usage

LMs:

• provide a way to quantify the uncertainly associated with natural languages
• allow to answer Information Retrieval related questions e.g.
• how likely would a user use a query with word "baseball" if he wants to find about sport

## Building Language Models

Can't enumerate all possible sentences, so need to make some assumptions to simplify the model

Notation:

• $V = \{w_1, \ ... \ , w_N\}$ is the vocabulary, $|V| = N$

## Unigrams

This is the simplest LM

• a word sequence is generated by sampling each word independently
• Assumption: for a sequence $[w_1, \ ... \ , w_n]$, $P(w_1, \ ... \ , w_n) = \prod_{i} P(w_i)$

Let $\theta$ be a Unigram LM

• there are $|V|$ params in $\theta$
• condition: $\sum p(w_i \mid \theta) = 1$
• so unigram LM specifies a Multinomial Distribution over words
• according to unigram models, sentences with high-probability words have higher probability

High probability words also can indicate topics

• e.g. $\theta_1$ about Text Mining
• $\theta_2$ about food
• if a document $D$ is a text mining paper, than we expect $P(D \mid \theta_1) > P(D \mid \theta_2)$

### Inference

Suppose we have $D$ generated by some unigram model $\theta$

• and we want to learn that are the parameters of $\theta$: estimate $P(w \mid \theta)$ for each $w \in D$
• this is a Parameter Estimation Problem
• Maximum Likelihood Estimation (MLE) is a possible solution: $\hat \theta = \operatorname{arg max}_{\theta} P(D \mid \theta)$
• since unigrams is a multinomial probability distribution, MLE is:
• $P(w \mid \hat \theta) = \cfrac{c(w, D)}{|D|}$ where $c(w, D)$ is the count of word $w$ in document $D$ and $|D|$ is the total number of words in the document

How this formula is derived?

• consider log likelihood function: $\log P(D \mid \theta) = \sum_{w \in V} c(w, D) \, \log P(w \mid \theta)$
• we want to maximize it s.t. $P(w \mid \theta)$ is a Probability Distribution i.e. $\sum_{w \in V} P(w \mid \theta) = 1$
• use Lagrange Multipliers to convert this constrained optimization problem into an unconstrained one
• so let $L(\theta, \lambda) = \log P(D \mid \theta) + \lambda \left(1 - \sum P(w \mid \theta) \right) = \sum_{w \in V} c(w, D) \, \log P(w \mid \theta) + \lambda \left(1 - \sum P(w \mid \theta) \right)$
• by solving it, we get $P(w \mid \hat \theta) = \cfrac{c(w, D)}{|D|}$
• MLE may overfit the data: it will assign 0 probabilities to words it hasn't seen
• What to do with it?
• Bayesian Parameter Estimation can both maximize the data likelihood and incorporate the prior belief to "smooth" the estimate
• use MAP: Maximum A Posteriori Estimation:
• $\hat \theta = \operatorname{arg max}_{\theta} P(\theta \mid D) = \operatorname{arg max}_{\theta} P(D \mid \theta) \, P(\theta)$
• so we can define some prior $P(\theta)$, and depending on the choice of prior, we'd have different estimators
• if the prior prefers models that don't assign 0 probability to any $w$, then at the end we won't have 0 entries
• adjusting MLE to avoid 0 probability is called "smoothing" - it's a form of regularization

## N-grams

Unigrams make unrealistic assumptions about word occurrence:

• generally word order matters

Bigram models:

• Have the Markov Property:
• $P(w_1, \ ... \ , w_n) = P(w_1) \cdot \prod_i P(w_i \mid w_{i-1})$
• can capture local dependency between two adjacent words

Usage:

• better than unigrams in complex NLP tasks such as Machine Translation, Speech Recognition, etc
• not much used in IR: the IR task is relatively easy and these models usually don't give much performance gain over unigrams

## Evaluation of Language Models

### Indirect Way

Evaluate not the LM itself, but the application where LM is used

• if the LM is used in an IR system, evaluate the performance of this IR system using different LMs (e.g. with Precision and Recall)

### Direct Way

Standalone evaluation of LM

## Structured Models

There are also models that take into account the structure of natural languages

• for example Probabilistic Context-Free Grammars allow to capture remote dependencies
• these models are complex and require a lot of data to estimate the parameters
• usually simpler models work just as good