Inverted Index
In Databases, indexing is needed to speed up queries
- want to avoid full table scan
- same is true for Information Retrieval and other Text Mining/NLP tasks
- Inverted index is a way of achieving this, and it can be generalized to other forms of input, not just text
For IR
- index is a partial representation of a document that contains the most important information about the document
- usually want to find terms to index automatically
Inverted Index for Similarity Search
General Idea
Idea:
- usually a document contains only a small portion of terms
- so document vectors are very sparse
- typical distance is cosine similarity - it ignores zeros. for cosine to be non-zero, two docs need to share at least one term
- $D^T$ is the inverted index of the term-document matrix $D$
This, to find docs similar to $d$:
- for each $w_i \in d$
- let $D_i = \text{index}[w_i] - d$ be a set of documents that contain $w_i$ (except for $d$ itself)
- then take the union of all $D_i$
- calculate similarity only with documents from this union
Can be used in Document Clustering to speed up similarity computation
Implementation
Posting List
Build a dictionary: a "posting" list
- for each word we store ids of documents that have this word
- document are sorted by ids
-
- source of picture: [1]
- sorting - because it's easier to take union: just merge the posting list
Sources
- Information Retrieval (UFRT)
- Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [2]