Information Retrieval Models

General goal of an Information Retrieval systems: rank relevant items much higher than non-relevant

  • to do it, the items must be scored

Retrieval function is a scoring function that's used to rank documents

  • retrieval function is based on a retrieval model
  • Retrieval Model defines the notion of relevance and makes it possible to rank the documents

There are 5 categories of IR models

  • they define the retrieval function in different ways
  • they also different in how they define/measure relevance

Similarity-Based Models

  • main assumption: relevance of a query $Q$ to the document $D$ is correlated with $\text{similarity}(Q, D)$
  • i.e. the more similar a document $D$ to the query $Q$, the more relevant $Q$ to $D$ is
  • potentially can use any similarity function

Algebraic Model

Vector Space Models are most well-known

  • use Bag-of-Word to build a vector space
  • both documents and the query are represented as vectors in this space
  • each term is assigned some weight that reflects the importance of this term
  • and then we use Cosine Similarity or Inner Product to rank queries

It's a framework that defines:

  • Term VSM: how documents and queries are represented (by terms they have)
  • Similarity measure defined on this vector space
  • also it has Document VSM: how terms are represented (terms are represented by documents where they are used) - but it's not very relevant for IR


Probabilistic Relevance Models

Probabilistic Retrieval Model

  • relevance = "what is the probability that document $D$ is relevant to the query $Q$?"
  • Binary Independence Retrieval - classical probabilistic IR model, assumes term independence
  • it's sort of "Naive Bayes Classifier" for IR
  • BM25 Ranking Function is comparable with TF-IDF weighting performance

Probabilistic Inference Models

Decision-Theoretic Retrieval Framework

Query Likelihood Retrieval Model

Query Likelihood scoring method