Information Retrieval

Information Retrieval is about quickly finding materials in a large collection of unstructured data

  • IR is theories, principles and algorithms to find relevant information for a collection of unstructured data - usually text data


Information Retrieval System

IR system:

  • A user of an Information Retrieval system has some information need which he wants to satisfy by sending it a query
  • The system returns a list of results ranked by relevance
  • ed75b03538e34051afb9a998ee5b4567.png


So,

  • general goal of IR systems: rank relevant items much higher than non-relevant
  • to do it, the items must be scored, and it's done with a retrieval function


IR Problem

Problem:

  • we have a large data collection of unstructured data
  • user interacts with the collection by sending a query
  • we find some results, but in what order should they be presented?

The central problem of IR is ranking elements of the collection according to relevance for a user query


Text IR:

  • Most common usecase for IR systems is finding something in textual data
  • the goal of textual IR system is to retrieve most relevant documents for the user


Basic formulation:

  • Given a collection $C = \{d_1, d_2, \ ... \ , d_N \}$ and a query $q \in Q$
  • IR system given $q$ returns a ranked list of relevant documents from $C$
  • documents are relevant when they contain information that the user looks for

e.g. they contain the answer to the query


Relevancy

Retrieval function is a scoring function that's used to rank documents

  • retrieval function is based on a retrieval model
  • Retrieval Model defines the notion of relevance and makes it possible to rank the documents


Information Retrieval Models:


Given the model we define a retrieval function $s$

  • $s: Q \times C \to \mathbb R$
  • it takes a query and and a document form $C$ and returns a rank value: some real number
  • we can apply $s$ to all documents in $C$ to rank them


Which documents are relevant?

  • in the Vector Space Model
  • assumption: relevance of a document $d$ w.r.t. to a query $q$ correlated with similarity between query and document:
  • So can use some similarity function $\text{similarity}(d, q)$ to find if a document is relevant
  • so the retrieval function can be a similarity function, and you'll just need to sort all documents in $C$ by their similarity to $q$



Retrieval Process

4b9a9b1a60d041b2b4dfeca4b7989586.png


The retrieval consists of several phases:

  • offline:
    • collection preparation
    • parse documents
    • build Inverted Index: index the documents so they can be retrieved faster
  • online:
    • query preparation (parse the query)
    • find relevant documents and rand them according to the relevance score


Plus, the ranking may account user's preferences:

  • To make the results better
  • To personalize the output
  • Also we may want to account for user feedback


Evaluation

How to evaluate an IR system?

  • use an IR test collection which consists of:
  • documents
  • queries
  • and know relevance of each query for each document


Given a test collection, the quality of an IR system is evaluated with:

  • Precision and Recall
  • Precision: % of relevant documents in the result
  • Recall: % of retrieved relevant documents

Definitions:

  • If $X \subseteq C$ is the output of the IR system and $Y \subseteq C$ is the list of all relevant documents
  • then $P = \cfrac{|X \cup Y|}{| X|}$ and $R = \cfrac{|X \cup Y|}{| Y |}$


Document Clustering for IR

Ranked list is not the only way of presenting the retrieval results to the user

  • the results can also be clustered
  • so we want to present internal relationships between documents to IR user when outputting the result
  • Scatter/Gather, a variant of K-Means for documents - first popular IR clustering technique, used for clustering web search results


There are two ways to cluster the results:

  • search-result clustering (post-retrieval document clustering)
  • pre-retrieval document clustering


Pre-retrieval


Sources