ML Wiki
Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

Probabilistic Retrieval Model

Probabilistic Retrieval Model

Probabilistic Ranking Principle

Due to Robertson 1977

  • relevance = “what is the probability that document $D$ is relevant to the query $Q$?”

Assumptions

  • ’'’ranking assumption’’’: usefulness of a relevant document depends on the number of relevant documents the user has already seen
    • the more documents we see - the less useful they are
  • ’'’independence assumption’’’: relevance of $D_i$ to $Q$ is independent to other documents $D_j$ from the collection
    • therefore we can apply it for each document separately

Relevance Function

  • $R = { r, \lnot r }$ a binary random variable that indicates relevance
  • let $r$ represent the event that $D$ is relevant
  • $\lnot r$ represent the event that $D$ is not relevant

We need estimate the probability of relevance of a document $D$ w.r.t a query $Q$

  • Need to find:
  • $P(R = r \mid D, Q)$ - prob that $D$ is relevant to $Q$
  • $P(R = \lnot r \mid D, Q)$ - prob that $D$ is not relevant to $Q$
  • So we need to estimate (learn) $P(R \mid D, Q)$

Descriptive Model Approach

Sometimes referred as a “Machine Learning” approach

It then becomes a Classification Problem:

  • can learn $P(R \mid D, Q)$ with a Descriptive Model
  • $R$ depends on features that characterize the relationships between $D$ and $Q$
  • for example, # of matched terms

Suppose

Comments:

  • will have to spend a lot of time engineering good features
  • also model has to learn to rank rather than just classify
  • but sometimes it will work even better than Vector Space Models
  • such models can use scores from other IR techniques and combine them + use some extra ones

Literature:

  • Joachims, Thorsten, et al. “Learning to rank for information retrieval.” 2007. [http://www.sigir.org/files/forum/2007D/2007d_sigirforum_joachims.pdf]
  • Liu, Tie-Yan. “Learning to rank for information retrieval.” 2009. [http://didawikinf.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/ir/ir13/1_-_learning_to_rank.pdf]

Generative Models Approach

Can use the Bayes Rule to infer the probabilities

  • $P(R = r \mid D, Q) = \cfrac{P(D, Q \mid R = r) \, P(R = r)}{P(D, Q)}$
  • $P(R = \lnot r \mid D, Q) = \cfrac{P(D, Q \mid R = \lnot r) \, P(R = \lnot r)}{P(D, Q)}$
  • and then compare $P(R = r \mid D, Q)$ and $P(R = \lnot r \mid D, Q)$

Interpretation:

  • $P(D, Q \mid R = r)$ probability that if we know that a relevant document is retrieved, it’s $D$
  • $P(R = r)$ probability of retrieving a relevant document
  • $P(D, Q)$ probability of retrieving $D$ and issuing $Q$

Log Odds Ratio:

  • it’s the same as using Odds ratio:
  • if $\cfrac{P(R = r \mid D, Q)}{P(R = \lnot \mid D, Q)} > 1$ or $\log \cfrac{P(R = r \mid D, Q)}{P(R = \lnot \mid D, Q)} > 0$ then $D$ is relevant w.r.t $Q$
  • so again the formulated the problem as two-category Document Classification
  • but we’re more interested in the scores rather than class outcome

Binary Independence Retrieval Model

BIR Model:

  • classical probabilistic IR Model
  • assumes that terms are independent: it’s a “Naive Bayes Classifier” for IR

Documents are represented by binary vectors

  • if a term is present in a document, it’s 1, otherwise it’s 0
  • $T = {T_i }$ with $T_i = 1$ if $w_i$ is present in the document $D$

Ranking

Can rank using log odds:

  • use $s(Q, D) = \log \cfrac{P(R = r \mid D, Q)}{P(R = \lnot r \mid D, Q)} = \log \cfrac{P(D, Q \mid R = r) \, P(R = r)}{P(D, Q \mid R = \lnot r) \, P(R = \lnot r)}$
  • $P(R = r)$ and $P(R = \lnot r)$ are just constants and will not change relative positions of documents in the rating, so let’s remove them:
  • $s(Q, D) = \log \cfrac{P(D, Q \mid R = r)}{P(D, Q \mid R = \lnot r)}$
  • by independence have:
  • $s(Q, D) = \log \cfrac{\prod P(T_i \mid Q, R = r)}{\prod P(T_i \mid Q, R = \lnot r)} = \log \prod \cfrac{P(T_i \mid Q, R = r)}{P(T_i \mid Q, R = \lnot r)} = \sum \log \cfrac{P(T_i \mid Q, R = r)}{P(T_i \mid Q, R = \lnot r)}$
  • can sum only over words present in the query:
  • $s(Q, D) = \sum\limits_{i \in Q} \log \cfrac{P(T_i \mid Q, R = r)}{P(T_i \mid Q, R = \lnot r)}$
  • now let $p_i = P(T_1 = 1 \mid Q, R = r)$ and $q_i = P(T_1 = 1 \mid Q, R = \lnot r)$. then
  • $s(Q, D) = \sum\limits_{i \in Q \cap D} \log \cfrac{p_i \, (1 - q_i)}{(1 - p_i) \, q_i}$
  • let $c_i = \log \cfrac{p_i \, (1 - q_i)}{(1 - p_i) \, q_i}$, so we have
  • $s(Q, D) = \sum\limits_{i \in Q \cap D} c_i$

Comments:

  • here we take into account only presence/absence of words
  • we don’t care about frequency
  • we assume a Multiple Bernoulli Event Model: $P(T_i = 1 \mid Q, R) + P(T_i = 0 \mid Q, R) = 0$

Estimation of Probabilities

How to estimate $p_i$ and $q_i$?

Robertson / Sparck Jones Formula

  • based on the training IR test corpus
  • and also can be account relevance feedback

Notation

  • given a query $Q$ and a corpus $C$
  • let $N$ be the number of documents in the corpus
  • let $R$ be the number of documents relevant to $Q$
  • $n_i$ # of docs that have term $w_i$
  • $r_i$ # of relevant docs that have term $w_i$

Then:

  • $p_i = \cfrac{r_i + \lambda}{R + 2 \lambda}$
  • $q_i = \cfrac{n_i - r_i + \lambda}{N - R + 2 \lambda}$
  • here we add some distortion value $\lambda$ (can be e.g. $\lambda = 0.5$) to avoid getting logs with 0
  • It’s Laplace Smoothing

BM25 Retrieval Function

Use 2-Poisson Mixture Model with a Term Frequency formula

BM25:

  • $\text{tf}(t, D)$ - how many times $t$ occurs in document $D$
  • $\text{df}(t \mid C)$ - how many documents in the corpus $C$ contain $t$
  • $$\text{bm25}(Q, D \mid C) = \sum_{t \in Q, D} \ln \cfrac- where $ D $ the length of $D$ and $ \bar D $ is the average length of a document in $C$ - params: $t_1 \in [1, 2]$, $b = 0.75$ and $k_3 \in [0, 1000]$
  • Note that BM25 formula is very similar to TF-IDFs

See

  • Robertson, Stephen E., and Steve Walker. “Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval.” 1994. [http://nclt.computing.dcu.ie/~gjones/Teaching/CA437/p232.pdf]

Other Probabilistic Models

Sources