## Multinomial Distribution

It's an extension of Binomial Distribution

- consider log likelihood function: $\log P(D \mid \theta) = \sum_{w \in V} c(w, D) \, \log P(w \mid \theta)$
- we want to maximize it s.t. $P(w \mid \theta)$ is a Probability Distribution i.e. $\sum_{w \in V} P(w \mid \theta) = 1$
- use Lagrange Multipliers to convert this constrained optimization problem into an unconstrained one
- so let $L(\theta, \lambda) = \log P(D \mid \theta) + \lambda \left(1 - \sum P(w \mid \theta) \right) = \sum_{w \in V} c(w, D) \, \log P(w \mid \theta) + \lambda \left(1 - \sum P(w \mid \theta) \right)$
- by solving it, we get $P(w \mid \hat \theta) = \cfrac{c(w, D)}{|D|}$

## Sources

- Zhai, ChengXiang. "Statistical language models for information retrieval." 2008.