Probabilistic Retrieval Model
Probabilistic Ranking Principle
Due to Robertson 1977
- relevance = "what is the probability that document $D$ is relevant to the query $Q$?"
Assumptions
- ranking assumption: usefulness of a relevant document depends on the number of relevant documents the user has already seen
- the more documents we see - the less useful they are
- independence assumption: relevance of $D_i$ to $Q$ is independent to other documents $D_j$ from the collection
- therefore we can apply it for each document separately
Relevance Function
- $R = \{ r, \lnot r \}$ a binary random variable that indicates relevance
- let $r$ represent the event that $D$ is relevant
- $\lnot r$ represent the event that $D$ is not relevant
We need estimate the probability of relevance of a document $D$ w.r.t a query $Q$
- Need to find:
- $P(R = r \mid D, Q)$ - prob that $D$ is relevant to $Q$
- $P(R = \lnot r \mid D, Q)$ - prob that $D$ is not relevant to $Q$
- So we need to estimate (learn) $P(R \mid D, Q)$
Sometimes referred as a "Machine Learning" approach
It then becomes a Classification Problem:
- can learn $P(R \mid D, Q)$ with a Descriptive Model
- $R$ depends on features that characterize the relationships between $D$ and $Q$
- for example, # of matched terms
Suppose
Comments:
- will have to spend a lot of time engineering good features
- also model has to learn to rank rather than just classify
- but sometimes it will work even better than Vector Space Models
- such models can use scores from other IR techniques and combine them + use some extra ones
Literature:
- Joachims, Thorsten, et al. "Learning to rank for information retrieval." 2007. [1]
- Liu, Tie-Yan. "Learning to rank for information retrieval." 2009. [2]
Can use the Bayes Rule to infer the probabilities
- $P(R = r \mid D, Q) = \cfrac{P(D, Q \mid R = r) \, P(R = r)}{P(D, Q)}$
- $P(R = \lnot r \mid D, Q) = \cfrac{P(D, Q \mid R = \lnot r) \, P(R = \lnot r)}{P(D, Q)}$
- and then compare $P(R = r \mid D, Q)$ and $P(R = \lnot r \mid D, Q)$
Interpretation:
- $P(D, Q \mid R = r)$ probability that if we know that a relevant document is retrieved, it's $D$
- $P(R = r)$ probability of retrieving a relevant document
- $P(D, Q)$ probability of retrieving $D$ and issuing $Q$
Log Odds Ratio:
- it's the same as using Odds ratio:
- if $\cfrac{P(R = r \mid D, Q)}{P(R = \lnot \mid D, Q)} > 1$ or $\log \cfrac{P(R = r \mid D, Q)}{P(R = \lnot \mid D, Q)} > 0$ then $D$ is relevant w.r.t $Q$
- so again the formulated the problem as two-category Document Classification
- but we're more interested in the scores rather than class outcome
Binary Independence Retrieval Model
BIR Model:
Documents are represented by binary vectors
- if a term is present in a document, it's 1, otherwise it's 0
- $T = \{T_i \}$ with $T_i = 1$ if $w_i$ is present in the document $D$
Ranking
Can rank using log odds:
- use $s(Q, D) = \log \cfrac{P(R = r \mid D, Q)}{P(R = \lnot r \mid D, Q)} = \log \cfrac{P(D, Q \mid R = r) \, P(R = r)}{P(D, Q \mid R = \lnot r) \, P(R = \lnot r)}$
- $P(R = r)$ and $P(R = \lnot r)$ are just constants and will not change relative positions of documents in the rating, so let's remove them:
- $s(Q, D) = \log \cfrac{P(D, Q \mid R = r)}{P(D, Q \mid R = \lnot r)}$
- by independence have:
- $s(Q, D) = \log \cfrac{\prod P(T_i \mid Q, R = r)}{\prod P(T_i \mid Q, R = \lnot r)} = \log \prod \cfrac{P(T_i \mid Q, R = r)}{P(T_i \mid Q, R = \lnot r)} = \sum \log \cfrac{P(T_i \mid Q, R = r)}{P(T_i \mid Q, R = \lnot r)}$
- can sum only over words present in the query:
- $s(Q, D) = \sum\limits_{i \in Q} \log \cfrac{P(T_i \mid Q, R = r)}{P(T_i \mid Q, R = \lnot r)}$
- now let $p_i = P(T_1 = 1 \mid Q, R = r)$ and $q_i = P(T_1 = 1 \mid Q, R = \lnot r)$. then
- $s(Q, D) = \sum\limits_{i \in Q \cap D} \log \cfrac{p_i \, (1 - q_i)}{(1 - p_i) \, q_i}$
- let $c_i = \log \cfrac{p_i \, (1 - q_i)}{(1 - p_i) \, q_i}$, so we have
- $s(Q, D) = \sum\limits_{i \in Q \cap D} c_i$
Comments:
- here we take into account only presence/absence of words
- we don't care about frequency
- we assume a Multiple Bernoulli Event Model: $P(T_i = 1 \mid Q, R) + P(T_i = 0 \mid Q, R) = 0$
Estimation of Probabilities
How to estimate $p_i$ and $q_i$?
Robertson / Sparck Jones Formula
- based on the training IR test corpus
- and also can be account relevance feedback
Notation
- given a query $Q$ and a corpus $C$
- let $N$ be the number of documents in the corpus
- let $R$ be the number of documents relevant to $Q$
- $n_i$ # of docs that have term $w_i$
- $r_i$ # of relevant docs that have term $w_i$
Then:
- $p_i = \cfrac{r_i + \lambda}{R + 2 \lambda}$
- $q_i = \cfrac{n_i - r_i + \lambda}{N - R + 2 \lambda}$
- here we add some distortion value $\lambda$ (can be e.g. $\lambda = 0.5$) to avoid getting logs with 0
- It's Laplace Smoothing
BM25 Retrieval Function
Use 2-Poisson Mixture Model with a Term Frequency formula
BM25:
- $\text{tf}(t, D)$ - how many times $t$ occurs in document $D$
- $\text{df}(t \mid C)$ - how many documents in the corpus $C$ contain $t$
- $$\text{bm25}(Q, D \mid C) = \sum_{t \in Q, D} \ln \cfrac{|C| - \text{df}(t \mid C) + 0.5}{\text{df}(t \mid C) + 0.5} \cdot \cfrac{(k_1 + 1) \, \text{tf}(t, D)}{k_1 \Big( (1 - b) + b \, | D| / | \bar D | \Big) + \text{tf}(t, D)} \cdot \cfrac{(k_3 + 1) \, \text{tf}(t, Q)}{k_3 + \text{tf}(t, Q)}$$
- where $| D |$ the length of $D$ and $| \bar D |$ is the average length of a document in $C$
- params: $t_1 \in [1, 2]$, $b = 0.75$ and $k_3 \in [0, 1000]$
- Note that BM25 formula is very similar to TF-IDFs
See
- Robertson, Stephen E., and Steve Walker. "Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval." 1994. [3]
Other Probabilistic Models
Sources