Probabilistic Retrieval Model
Probabilistic Ranking Principle
Due to Robertson 1977
- relevance = “what is the probability that document $D$ is relevant to the query $Q$?”
Assumptions
- ’'’ranking assumption’’’: usefulness of a relevant document depends on the number of relevant documents the user has already seen
- the more documents we see - the less useful they are
- ’'’independence assumption’’’: relevance of $D_i$ to $Q$ is independent to other documents $D_j$ from the collection
- therefore we can apply it for each document separately
Relevance Function
- $R = { r, \lnot r }$ a binary random variable that indicates relevance
- let $r$ represent the event that $D$ is relevant
- $\lnot r$ represent the event that $D$ is not relevant
We need estimate the probability of relevance of a document $D$ w.r.t a query $Q$
- Need to find:
- $P(R = r \mid D, Q)$ - prob that $D$ is relevant to $Q$
- $P(R = \lnot r \mid D, Q)$ - prob that $D$ is not relevant to $Q$
- So we need to estimate (learn) $P(R \mid D, Q)$
Descriptive Model Approach
Sometimes referred as a “Machine Learning” approach
It then becomes a Classification Problem:
- can learn $P(R \mid D, Q)$ with a Descriptive Model
- $R$ depends on features that characterize the relationships between $D$ and $Q$
- for example, # of matched terms
Suppose
- we have $k$ features $F_i(Q, D)$
- $f$ is a function with parameter $\Lambda$ s.t. $f(F_1, \ … \ , F_k, \Lambda) = P(R \mid D, Q)$
- then can fit $f$ with some ML algorithm e.g. OLS Regression, Logistic Regression or Kernel Ridge Regression
Comments:
- will have to spend a lot of time engineering good features
- also model has to learn to rank rather than just classify
- but sometimes it will work even better than Vector Space Models
- such models can use scores from other IR techniques and combine them + use some extra ones
Literature:
- Joachims, Thorsten, et al. “Learning to rank for information retrieval.” 2007. [http://www.sigir.org/files/forum/2007D/2007d_sigirforum_joachims.pdf]
- Liu, Tie-Yan. “Learning to rank for information retrieval.” 2009. [http://didawikinf.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/ir/ir13/1_-_learning_to_rank.pdf]
Generative Models Approach
Can use the Bayes Rule to infer the probabilities
- $P(R = r \mid D, Q) = \cfrac{P(D, Q \mid R = r) \, P(R = r)}{P(D, Q)}$
- $P(R = \lnot r \mid D, Q) = \cfrac{P(D, Q \mid R = \lnot r) \, P(R = \lnot r)}{P(D, Q)}$
- and then compare $P(R = r \mid D, Q)$ and $P(R = \lnot r \mid D, Q)$
Interpretation:
- $P(D, Q \mid R = r)$ probability that if we know that a relevant document is retrieved, it’s $D$
- $P(R = r)$ probability of retrieving a relevant document
- $P(D, Q)$ probability of retrieving $D$ and issuing $Q$
Log Odds Ratio:
- it’s the same as using Odds ratio:
- if $\cfrac{P(R = r \mid D, Q)}{P(R = \lnot \mid D, Q)} > 1$ or $\log \cfrac{P(R = r \mid D, Q)}{P(R = \lnot \mid D, Q)} > 0$ then $D$ is relevant w.r.t $Q$
- so again the formulated the problem as two-category Document Classification
- but we’re more interested in the scores rather than class outcome
Binary Independence Retrieval Model
BIR Model:
- classical probabilistic IR Model
- assumes that terms are independent: it’s a “Naive Bayes Classifier” for IR
Documents are represented by binary vectors
- if a term is present in a document, it’s 1, otherwise it’s 0
- $T = {T_i }$ with $T_i = 1$ if $w_i$ is present in the document $D$
Ranking
Can rank using log odds:
- use $s(Q, D) = \log \cfrac{P(R = r \mid D, Q)}{P(R = \lnot r \mid D, Q)} = \log \cfrac{P(D, Q \mid R = r) \, P(R = r)}{P(D, Q \mid R = \lnot r) \, P(R = \lnot r)}$
- $P(R = r)$ and $P(R = \lnot r)$ are just constants and will not change relative positions of documents in the rating, so let’s remove them:
- $s(Q, D) = \log \cfrac{P(D, Q \mid R = r)}{P(D, Q \mid R = \lnot r)}$
- by independence have:
- $s(Q, D) = \log \cfrac{\prod P(T_i \mid Q, R = r)}{\prod P(T_i \mid Q, R = \lnot r)} = \log \prod \cfrac{P(T_i \mid Q, R = r)}{P(T_i \mid Q, R = \lnot r)} = \sum \log \cfrac{P(T_i \mid Q, R = r)}{P(T_i \mid Q, R = \lnot r)}$
- can sum only over words present in the query:
- $s(Q, D) = \sum\limits_{i \in Q} \log \cfrac{P(T_i \mid Q, R = r)}{P(T_i \mid Q, R = \lnot r)}$
- now let $p_i = P(T_1 = 1 \mid Q, R = r)$ and $q_i = P(T_1 = 1 \mid Q, R = \lnot r)$. then
- $s(Q, D) = \sum\limits_{i \in Q \cap D} \log \cfrac{p_i \, (1 - q_i)}{(1 - p_i) \, q_i}$
- let $c_i = \log \cfrac{p_i \, (1 - q_i)}{(1 - p_i) \, q_i}$, so we have
- $s(Q, D) = \sum\limits_{i \in Q \cap D} c_i$
Comments:
- here we take into account only presence/absence of words
- we don’t care about frequency
- we assume a Multiple Bernoulli Event Model: $P(T_i = 1 \mid Q, R) + P(T_i = 0 \mid Q, R) = 0$
Estimation of Probabilities
How to estimate $p_i$ and $q_i$?
Robertson / Sparck Jones Formula
- based on the training IR test corpus
- and also can be account relevance feedback
Notation
- given a query $Q$ and a corpus $C$
- let $N$ be the number of documents in the corpus
- let $R$ be the number of documents relevant to $Q$
- $n_i$ # of docs that have term $w_i$
- $r_i$ # of relevant docs that have term $w_i$
Then:
- $p_i = \cfrac{r_i + \lambda}{R + 2 \lambda}$
- $q_i = \cfrac{n_i - r_i + \lambda}{N - R + 2 \lambda}$
- here we add some distortion value $\lambda$ (can be e.g. $\lambda = 0.5$) to avoid getting logs with 0
- It’s Laplace Smoothing
BM25 Retrieval Function
Use 2-Poisson Mixture Model with a Term Frequency formula
BM25:
- $\text{tf}(t, D)$ - how many times $t$ occurs in document $D$
- $\text{df}(t \mid C)$ - how many documents in the corpus $C$ contain $t$
-
$$\text{bm25}(Q, D \mid C) = \sum_{t \in Q, D} \ln \cfrac- where $ D $ the length of $D$ and $ \bar D $ is the average length of a document in $C$ - params: $t_1 \in [1, 2]$, $b = 0.75$ and $k_3 \in [0, 1000]$ - Note that BM25 formula is very similar to TF-IDFs
See
- Robertson, Stephen E., and Steve Walker. “Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval.” 1994. [http://nclt.computing.dcu.ie/~gjones/Teaching/CA437/p232.pdf]
Other Probabilistic Models
Sources
- Information Retrieval (UFRT)
- Zhai, ChengXiang. “Statistical language models for information retrieval.” 2008.