Information Retrieval
Information Retrieval is about quickly finding materials in a large collection of unstructured data
- IR is theories, principles and algorithms to find relevant information for a collection of unstructured data - usually text data
Information Retrieval System
IR system:
- A user of an Information Retrieval system has some information need which he wants to satisfy by sending it a query
- The system returns a list of results ranked by relevance
-

So,
- general goal of IR systems: rank relevant items much higher than non-relevant
- to do it, the items must be scored, and it's done with a retrieval function
IR Problem
Problem:
- we have a large data collection of unstructured data
- user interacts with the collection by sending a query
- we find some results, but in what order should they be presented?
The central problem of IR is ranking elements of the collection according to relevance for a user query
Text IR:
- Most common usecase for IR systems is finding something in textual data
- the goal of textual IR system is to retrieve most relevant documents for the user
Basic formulation:
- Given a collection C={d1,d2, ... ,dN} and a query q∈Q
- IR system given q returns a ranked list of relevant documents from C
- documents are relevant when they contain information that the user looks for
e.g. they contain the answer to the query
Relevancy
Retrieval function is a scoring function that's used to rank documents
- retrieval function is based on a retrieval model
- Retrieval Model defines the notion of relevance and makes it possible to rank the documents
Information Retrieval Models:
Given the model we define a retrieval function s
- s:Q×C→R
- it takes a query and and a document form C and returns a rank value: some real number
- we can apply s to all documents in C to rank them
Which documents are relevant?
- in the Vector Space Model
- assumption: relevance of a document d w.r.t. to a query q correlated with similarity between query and document:
- So can use some similarity function similarity(d,q) to find if a document is relevant
- so the retrieval function can be a similarity function, and you'll just need to sort all documents in C by their similarity to q
Retrieval Process
The retrieval consists of several phases:
- offline:
- collection preparation
- parse documents
- build Inverted Index: index the documents so they can be retrieved faster
- online:
- query preparation (parse the query)
- find relevant documents and rand them according to the relevance score
Plus, the ranking may account user's preferences:
- To make the results better
- To personalize the output
- Also we may want to account for user feedback
Evaluation
How to evaluate an IR system?
- use an IR test collection which consists of:
- documents
- queries
- and know relevance of each query for each document
Given a test collection, the quality of an IR system is evaluated with:
- Precision and Recall
- Precision: % of relevant documents in the result
- Recall: % of retrieved relevant documents
Definitions:
- If X⊆C is the output of the IR system and Y⊆C is the list of all relevant documents
- then P=|X∪Y||X| and R=|X∪Y||Y|
Ranked list is not the only way of presenting the retrieval results to the user
- the results can also be clustered
- so we want to present internal relationships between documents to IR user when outputting the result
- Scatter/Gather, a variant of K-Means for documents - first popular IR clustering technique, used for clustering web search results
There are two ways to cluster the results:
- search-result clustering (post-retrieval document clustering)
- pre-retrieval document clustering
Pre-retrieval
Sources