# ML Wiki

## Probabilistic Retrieval Model

### Probabilistic Ranking Principle

Due to Robertson 1977

• relevance = "what is the probability that document $D$ is relevant to the query $Q$?"

### Assumptions

• ranking assumption: usefulness of a relevant document depends on the number of relevant documents the user has already seen
• the more documents we see - the less useful they are
• independence assumption: relevance of $D_i$ to $Q$ is independent to other documents $D_j$ from the collection
• therefore we can apply it for each document separately

## Relevance Function

• $R = \{ r, \lnot r \}$ a binary random variable that indicates relevance
• let $r$ represent the event that $D$ is relevant
• $\lnot r$ represent the event that $D$ is not relevant

We need estimate the probability of relevance of a document $D$ w.r.t a query $Q$

• Need to find:
• $P(R = r \mid D, Q)$ - prob that $D$ is relevant to $Q$
• $P(R = \lnot r \mid D, Q)$ - prob that $D$ is not relevant to $Q$
• So we need to estimate (learn) $P(R \mid D, Q)$

### Descriptive Model Approach

Sometimes referred as a "Machine Learning" approach

It then becomes a Classification Problem:

• can learn $P(R \mid D, Q)$ with a Descriptive Model
• $R$ depends on features that characterize the relationships between $D$ and $Q$
• for example, # of matched terms

Suppose

• will have to spend a lot of time engineering good features
• also model has to learn to rank rather than just classify
• but sometimes it will work even better than Vector Space Models
• such models can use scores from other IR techniques and combine them + use some extra ones

Literature:

• Joachims, Thorsten, et al. "Learning to rank for information retrieval." 2007. [1]
• Liu, Tie-Yan. "Learning to rank for information retrieval." 2009. [2]

### Generative Models Approach

Can use the Bayes Rule to infer the probabilities

• $P(R = r \mid D, Q) = \cfrac{P(D, Q \mid R = r) \, P(R = r)}{P(D, Q)}$
• $P(R = \lnot r \mid D, Q) = \cfrac{P(D, Q \mid R = \lnot r) \, P(R = \lnot r)}{P(D, Q)}$
• and then compare $P(R = r \mid D, Q)$ and $P(R = \lnot r \mid D, Q)$

Interpretation:

• $P(D, Q \mid R = r)$ probability that if we know that a relevant document is retrieved, it's $D$
• $P(R = r)$ probability of retrieving a relevant document
• $P(D, Q)$ probability of retrieving $D$ and issuing $Q$

Log Odds Ratio:

• it's the same as using Odds ratio:
• if $\cfrac{P(R = r \mid D, Q)}{P(R = \lnot \mid D, Q)} > 1$ or $\log \cfrac{P(R = r \mid D, Q)}{P(R = \lnot \mid D, Q)} > 0$ then $D$ is relevant w.r.t $Q$
• so again the formulated the problem as two-category Document Classification
• but we're more interested in the scores rather than class outcome

## Binary Independence Retrieval Model

BIR Model:

Documents are represented by binary vectors

• if a term is present in a document, it's 1, otherwise it's 0
• $T = \{T_i \}$ with $T_i = 1$ if $w_i$ is present in the document $D$

### Ranking

Can rank using log odds:

• use $s(Q, D) = \log \cfrac{P(R = r \mid D, Q)}{P(R = \lnot r \mid D, Q)} = \log \cfrac{P(D, Q \mid R = r) \, P(R = r)}{P(D, Q \mid R = \lnot r) \, P(R = \lnot r)}$
• $P(R = r)$ and $P(R = \lnot r)$ are just constants and will not change relative positions of documents in the rating, so let's remove them:
• $s(Q, D) = \log \cfrac{P(D, Q \mid R = r)}{P(D, Q \mid R = \lnot r)}$
• by independence have:
• $s(Q, D) = \log \cfrac{\prod P(T_i \mid Q, R = r)}{\prod P(T_i \mid Q, R = \lnot r)} = \log \prod \cfrac{P(T_i \mid Q, R = r)}{P(T_i \mid Q, R = \lnot r)} = \sum \log \cfrac{P(T_i \mid Q, R = r)}{P(T_i \mid Q, R = \lnot r)}$
• can sum only over words present in the query:
• $s(Q, D) = \sum\limits_{i \in Q} \log \cfrac{P(T_i \mid Q, R = r)}{P(T_i \mid Q, R = \lnot r)}$
• now let $p_i = P(T_1 = 1 \mid Q, R = r)$ and $q_i = P(T_1 = 1 \mid Q, R = \lnot r)$. then
• $s(Q, D) = \sum\limits_{i \in Q \cap D} \log \cfrac{p_i \, (1 - q_i)}{(1 - p_i) \, q_i}$
• let $c_i = \log \cfrac{p_i \, (1 - q_i)}{(1 - p_i) \, q_i}$, so we have
• $s(Q, D) = \sum\limits_{i \in Q \cap D} c_i$

• here we take into account only presence/absence of words
• we don't care about frequency
• we assume a Multiple Bernoulli Event Model: $P(T_i = 1 \mid Q, R) + P(T_i = 0 \mid Q, R) = 0$

### Estimation of Probabilities

How to estimate $p_i$ and $q_i$?

Robertson / Sparck Jones Formula

• based on the training IR test corpus
• and also can be account relevance feedback

Notation

• given a query $Q$ and a corpus $C$
• let $N$ be the number of documents in the corpus
• let $R$ be the number of documents relevant to $Q$
• $n_i$ # of docs that have term $w_i$
• $r_i$ # of relevant docs that have term $w_i$

Then:

• $p_i = \cfrac{r_i + \lambda}{R + 2 \lambda}$
• $q_i = \cfrac{n_i - r_i + \lambda}{N - R + 2 \lambda}$
• here we add some distortion value $\lambda$ (can be e.g. $\lambda = 0.5$) to avoid getting logs with 0
• It's Laplace Smoothing

## BM25 Retrieval Function

Use 2-Poisson Mixture Model with a Term Frequency formula

BM25:

• $\text{tf}(t, D)$ - how many times $t$ occurs in document $D$
• $\text{df}(t \mid C)$ - how many documents in the corpus $C$ contain $t$
• $$\text{bm25}(Q, D \mid C) = \sum_{t \in Q, D} \ln \cfrac{|C| - \text{df}(t \mid C) + 0.5}{\text{df}(t \mid C) + 0.5} \cdot \cfrac{(k_1 + 1) \, \text{tf}(t, D)}{k_1 \Big( (1 - b) + b \, | D| / | \bar D | \Big) + \text{tf}(t, D)} \cdot \cfrac{(k_3 + 1) \, \text{tf}(t, Q)}{k_3 + \text{tf}(t, Q)}$$
• where $| D |$ the length of $D$ and $| \bar D |$ is the average length of a document in $C$
• params: $t_1 \in [1, 2]$, $b = 0.75$ and $k_3 \in [0, 1000]$
• Note that BM25 formula is very similar to TF-IDFs

See

• Robertson, Stephen E., and Steve Walker. "Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval." 1994. [3]