ML Wiki

Information Retrieval

Information Retrieval is about quickly finding materials in a large collection of unstructured data

IR is theories, principles and algorithms to find relevant information for a collection of unstructured data - usually text data

Information Retrieval System

IR system:

A user of an Information Retrieval system has some information need which he wants to satisfy by sending it a query
The system returns a list of results ranked by relevance

So,

general goal of IR systems: rank relevant items much higher than non-relevant
to do it, the items must be scored, and it's done with a retrieval function

IR Problem

Problem:

we have a large data collection of unstructured data
user interacts with the collection by sending a query
we find some results, but in what order should they be presented?

The central problem of IR is ranking elements of the collection according to relevance for a user query

Text IR:

Most common usecase for IR systems is finding something in textual data
the goal of textual IR system is to retrieve most relevant documents for the user

Basic formulation:

Given a collection $C = \{d_1, d_2, \ ... \ , d_N \}$ and a query $q \in Q$
IR system given $q$ returns a ranked list of relevant documents from $C$
documents are relevant when they contain information that the user looks for

e.g. they contain the answer to the query

Relevancy

Retrieval function is a scoring function that's used to rank documents

retrieval function is based on a retrieval model
Retrieval Model defines the notion of relevance and makes it possible to rank the documents

Information Retrieval Models:

the model should be able to represent both objects in the collection and the queries
for documents the most popular one is Vector Space Model
but there are other Information Retrieval Models: e.g. Probabilistic Retrieval Model

Given the model we define a retrieval function $s$

$s: Q \times C \to \mathbb R$
it takes a query and and a document form $C$ and returns a rank value: some real number
we can apply $s$ to all documents in $C$ to rank them

Which documents are relevant?

in the Vector Space Model
assumption: relevance of a document $d$ w.r.t. to a query $q$ correlated with similarity between query and document:
So can use some similarity function $\text{similarity}(d, q)$ to find if a document is relevant
so the retrieval function can be a similarity function, and you'll just need to sort all documents in $C$ by their similarity to $q$

Retrieval Process

source: Information Retrieval (UFRT) L01-introduction.pdf

The retrieval consists of several phases:

offline:
- collection preparation
- parse documents
- build Inverted Index: index the documents so they can be retrieved faster
online:
- query preparation (parse the query)
- find relevant documents and rand them according to the relevance score

Plus, the ranking may account user's preferences:

To make the results better
To personalize the output
Also we may want to account for user feedback

Evaluation

How to evaluate an IR system?

use an IR test collection which consists of:
documents
queries
and know relevance of each query for each document

Given a test collection, the quality of an IR system is evaluated with:

Precision and Recall
Precision: % of relevant documents in the result
Recall: % of retrieved relevant documents

Definitions:

If $X \subseteq C$ is the output of the IR system and $Y \subseteq C$ is the list of all relevant documents
then $P = \cfrac{|X \cup Y|}{| X|}$ and $R = \cfrac{|X \cup Y|}{| Y |}$

Document Clustering for IR

Ranked list is not the only way of presenting the retrieval results to the user

the results can also be clustered
so we want to present internal relationships between documents to IR user when outputting the result
Scatter/Gather, a variant of K-Means for documents - first popular IR clustering technique, used for clustering web search results

There are two ways to cluster the results:

search-result clustering (post-retrieval document clustering)
pre-retrieval document clustering

Pre-retrieval

Lingo: use Matrix Decomposition techniques to produce low-dimensional basis for the document space
these base vectors can be interpreted as semantic vectors of the space
find clustering using SVD or Non-Negative Matrix Factorization

Sources

Information Retrieval (UFRT)
Zhai, ChengXiang. "Statistical language models for information retrieval." 2008.
http://datascience.stackexchange.com/questions/1106/how-to-build-a-textual-search-engine

Information Retrieval

Contents

Information Retrieval

Information Retrieval System

IR Problem

Relevancy

Retrieval Process

Evaluation

Document Clustering for IR

Sources