# ML Wiki

## Inverted Index

### Indexing for Information Retrieval

In Databases, indexing is needed to speed up queries

• want to avoid full table scan
• same is true for Information Retrieval and other Text Mining/NLP tasks
• Inverted index is a way of achieving this, and it can be generalized to other forms of input, not just text

For IR

• index is a partial representation of a document that contains the most important information about the document
• usually want to find terms to index automatically

## Inverted Index for Similarity Search

### General Idea

Idea:

• usually a document contains only a small portion of terms
• so document vectors are very sparse
• typical distance is cosine similarity - it ignores zeros. for cosine to be non-zero, two docs need to share at least one term
• $D^T$ is the inverted index of the term-document matrix $D$

This, to find docs similar to $d$:

• for each $w_i \in d$
• let $D_i = \text{index}[w_i] - d$ be a set of documents that contain $w_i$ (except for $d$ itself)
• then take the union of all $D_i$
• calculate similarity only with documents from this union

Can be used in Document Clustering to speed up similarity computation

## Implementation

### Posting List

Build a dictionary: a "posting" list

• for each word we store ids of documents that have this word
• document are sorted by ids
• source of picture: [1]
• sorting - because it's easier to take union: just merge the posting list

## Pairwise Similarity in Document Collection

### Relation Database

Suppose we have a table Documents(doc_id, word, weight)

• we index it on word: an have an inverted index
• then document-document similarity would be a self-join of Document with itself

### MapReduce

How do we use it to efficiently compute pair-wise similarity between each document

• given two documents $\mathbf d_1$, $\mathbf d_2$:
• $\text{sim}(\mathbf d_1, \mathbf d_2) = \sum_{t \in V} = w_{t, \mathbf d_i} \cdot w_{t, \mathbf d_j} = \sum_{t \in \mathbf d_i \cup \mathbf d_j} = w_{t, \mathbf d_i} \cdot w_{t, \mathbf d_j}$
• so we need to take into account only terms that both documents share

If we compute similarity for all documents:

• if a term $t$ appears only in documents $\mathbf x, \mathbf y, \mathbf z$, then it contributes only to the similarity scores between $(\mathbf x, \mathbf y), (\mathbf x, \mathbf z)$ and $(\mathbf y, \mathbf z)$

Algorithm:

• let $\text{posting}(t)$ be a function that returns all documents that contain $t$
• set $\text{sim}[i, j] = 0$ be the similarity matrix
• for $t \in V$ do:
• $p_t = \text{posting}(t)$
• for all pairs $(\mathbf d_i, \mathbf d_j) \in (p_t \times p_t)$ (s.t. $i > j$)
• $\text{sim}[i, j] = \text{sim}[i, j] + w_{t, \mathbf d_i} \cdot w_{t, \mathbf d_j}$

It can easily be implemented in MapReduce

• first MR job: build the index
• second MR job: compute pair-wise similarity

## Sources

• Information Retrieval (UFRT)
• Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [2]
• Elsayed, Tamer, Jimmy Lin, and Douglas W. Oard. "Pairwise document similarity in large collections with MapReduce." 2008. [[3]]