Information Retrieval Models

information-retrieval nlp

General goal of an Information Retrieval systems: rank relevant items much higher than non-relevant

'’Retrieval function’’ is a scoring function that’s used to rank documents

retrieval function is based on a retrieval model
Retrieval Model defines the notion of relevance and makes it possible to rank the documents

There are 5 categories of IR models

Similarity-Based Models

main assumption: relevance of a query $Q$ to the document $D$ is correlated with $\text{similarity}(Q, D)$
i.e. the more similar a document $D$ to the query $Q$, the more relevant $Q$ to $D$ is
potentially can use any similarity function

=== Algebraic Model === Vector Space Models are most well-known

It’s a framework that defines:

Term VSM: how documents and queries are represented (by terms they have)
Similarity measure defined on this vector space
also it has Document VSM: how terms are represented (terms are represented by documents where they are used) - but it’s not very relevant for IR

relevance = “what is the probability that document $D$ is relevant to the query $Q$?”
Binary Independence Retrieval - classical probabilistic IR model, assumes term independence
it’s sort of “Naive Bayes Classifier” for IR
BM25 Ranking Function is comparable with TF-IDF weighting performance

Query Likelihood scoring method

use Statistical Language Models for NLP
Ponte, Jay M., and W. Bruce Croft. “A language modeling approach to information retrieval.” 1998. [http://www.cs.unibo.it/~montesi/CBD/Articoli/LanguageModelApproachIR.pdf]

Information Retrieval (UFRT)
Zhai, ChengXiang. “Statistical language models for information retrieval.” 2008.

✏️ Edit on GitHub