# ML Wiki

## Information Retrieval Models

General goal of an Information Retrieval systems: rank relevant items much higher than non-relevant

• to do it, the items must be scored

Retrieval function is a scoring function that's used to rank documents

• retrieval function is based on a retrieval model
• Retrieval Model defines the notion of relevance and makes it possible to rank the documents

There are 5 categories of IR models

• they define the retrieval function in different ways
• they also different in how they define/measure relevance

## Similarity-Based Models

• main assumption: relevance of a query $Q$ to the document $D$ is correlated with $\text{similarity}(Q, D)$
• i.e. the more similar a document $D$ to the query $Q$, the more relevant $Q$ to $D$ is
• potentially can use any similarity function

### Algebraic Model

Vector Space Models are most well-known

• use Bag-of-Word to build a vector space
• both documents and the query are represented as vectors in this space
• each term is assigned some weight that reflects the importance of this term
• and then we use Cosine Similarity or Inner Product to rank queries

It's a framework that defines:

• Term VSM: how documents and queries are represented (by terms they have)
• Similarity measure defined on this vector space
• also it has Document VSM: how terms are represented (terms are represented by documents where they are used) - but it's not very relevant for IR

## Probabilistic Relevance Models

• relevance = "what is the probability that document $D$ is relevant to the query $Q$?"
• Binary Independence Retrieval - classical probabilistic IR model, assumes term independence
• it's sort of "Naive Bayes Classifier" for IR
• BM25 Ranking Function is comparable with TF-IDF weighting performance

## Probabilistic Inference Models

### Query Likelihood Retrieval Model

Query Likelihood scoring method