Line 41: Line 41:
 
* source of picture: [http://slidewiki.org/print/deck/339]
 
* source of picture: [http://slidewiki.org/print/deck/339]
 
* sorting - because it's easier to take union: just merge the posting list  
 
* sorting - because it's easier to take union: just merge the posting list  
 +
 +
 +
 +
== Pairwise Similarity in Document Collection ==
 +
=== [[Relation Database]] ===
 +
Suppose we have a table Documents(doc_id, <u>word</u>, weight)
 +
* we index it on word: an have an inverted index
 +
* then document-document similarity would be a self-join of Document with itself
 +
 +
 +
=== [[MapReduce]] ===
 +
How do we use it to efficiently compute pair-wise similarity between each document
 +
* given two documents $\mathbf d_1$, $\mathbf d_2$:
 +
* $\text{sim}(\mathbf d_1, \mathbf d_2) = \sum_{t \in V} = w_{t, \mathbf d_i} \cdot w_{t, \mathbf d_j} = \sum_{t \in \mathbf d_i \cup \mathbf d_j} = w_{t, \mathbf d_i} \cdot w_{t, \mathbf d_j} $
 +
* so we need to take into account only terms that both documents share
 +
 +
 +
If we compute similarity for all documents:
 +
* if a term $t$ appears only in documents $\mathbf x, \mathbf y, \mathbf z$, then it contributes only to the similarity scores between $(\mathbf x, \mathbf y), (\mathbf x, \mathbf z)$ and $(\mathbf y, \mathbf z)$
 +
 +
 +
 +
Algorithm:
 +
* let $\text{posting}(t)$ be a function that returns all documents that contain $t$
 +
* set $\text{sim}[i, j] = 0$ be the similarity matrix
 +
* for $t \in V$ do:
 +
** $p_t = \text{posting}(t)$
 +
** for all pairs $(\mathbf d_i, \mathbf d_j) \in (p_t \times p_t)$ (s.t. $i > j$)
 +
** $\text{sim}[i, j] = \text{sim}[i, j] + w_{t, \mathbf d_i} \cdot w_{t, \mathbf d_j}$
 +
 +
 +
It can easily be implemented in [[MapReduce]]
 +
* first MR job: build the index
 +
* second MR job: compute pair-wise similarity
  
  
Line 47: Line 81:
 
* [[Information Retrieval (UFRT)]]
 
* [[Information Retrieval (UFRT)]]
 
* Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [http://static.msi.umn.edu/rreports/2003/73.pdf]
 
* Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [http://static.msi.umn.edu/rreports/2003/73.pdf]
 
+
* Elsayed, Tamer, Jimmy Lin, and Douglas W. Oard. "Pairwise document similarity in large collections with MapReduce." 2008. [[http://www.ece.umd.edu/~oard/pdf/acl08elsayed2.pdf]]
  
 
[[Category:Information Retrieval]]
 
[[Category:Information Retrieval]]
 
[[Category:Database Indexes]]
 
[[Category:Database Indexes]]

Latest revision as of 23:59, 26 April 2017

Inverted Index

Indexing for Information Retrieval

In Databases, indexing is needed to speed up queries

  • want to avoid full table scan
  • same is true for Information Retrieval and other Text Mining/NLP tasks
  • Inverted index is a way of achieving this, and it can be generalized to other forms of input, not just text


For IR

  • index is a partial representation of a document that contains the most important information about the document
  • usually want to find terms to index automatically



Inverted Index for Similarity Search

General Idea

Idea:

  • usually a document contains only a small portion of terms
  • so document vectors are very sparse
  • typical distance is cosine similarity - it ignores zeros. for cosine to be non-zero, two docs need to share at least one term
  • $D^T$ is the inverted index of the term-document matrix $D$


This, to find docs similar to $d$:

  • for each $w_i \in d$
    • let $D_i = \text{index}[w_i] - d$ be a set of documents that contain $w_i$ (except for $d$ itself)
  • then take the union of all $D_i$
  • calculate similarity only with documents from this union


Can be used in Document Clustering to speed up similarity computation


Implementation

Posting List

Build a dictionary: a "posting" list

  • for each word we store ids of documents that have this word
  • document are sorted by ids
  • source of picture: [1]
  • sorting - because it's easier to take union: just merge the posting list


Pairwise Similarity in Document Collection

Relation Database

Suppose we have a table Documents(doc_id, word, weight)

  • we index it on word: an have an inverted index
  • then document-document similarity would be a self-join of Document with itself


MapReduce

How do we use it to efficiently compute pair-wise similarity between each document

  • given two documents $\mathbf d_1$, $\mathbf d_2$:
  • $\text{sim}(\mathbf d_1, \mathbf d_2) = \sum_{t \in V} = w_{t, \mathbf d_i} \cdot w_{t, \mathbf d_j} = \sum_{t \in \mathbf d_i \cup \mathbf d_j} = w_{t, \mathbf d_i} \cdot w_{t, \mathbf d_j} $
  • so we need to take into account only terms that both documents share


If we compute similarity for all documents:

  • if a term $t$ appears only in documents $\mathbf x, \mathbf y, \mathbf z$, then it contributes only to the similarity scores between $(\mathbf x, \mathbf y), (\mathbf x, \mathbf z)$ and $(\mathbf y, \mathbf z)$


Algorithm:

  • let $\text{posting}(t)$ be a function that returns all documents that contain $t$
  • set $\text{sim}[i, j] = 0$ be the similarity matrix
  • for $t \in V$ do:
    • $p_t = \text{posting}(t)$
    • for all pairs $(\mathbf d_i, \mathbf d_j) \in (p_t \times p_t)$ (s.t. $i > j$)
    • $\text{sim}[i, j] = \text{sim}[i, j] + w_{t, \mathbf d_i} \cdot w_{t, \mathbf d_j}$


It can easily be implemented in MapReduce

  • first MR job: build the index
  • second MR job: compute pair-wise similarity


Sources

  • Information Retrieval (UFRT)
  • Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [2]
  • Elsayed, Tamer, Jimmy Lin, and Douglas W. Oard. "Pairwise document similarity in large collections with MapReduce." 2008. [[3]]