Line 41: | Line 41: | ||
* source of picture: [http://slidewiki.org/print/deck/339] | * source of picture: [http://slidewiki.org/print/deck/339] | ||
* sorting - because it's easier to take union: just merge the posting list | * sorting - because it's easier to take union: just merge the posting list | ||
+ | |||
+ | |||
+ | |||
+ | == Pairwise Similarity in Document Collection == | ||
+ | === [[Relation Database]] === | ||
+ | Suppose we have a table Documents(doc_id, <u>word</u>, weight) | ||
+ | * we index it on word: an have an inverted index | ||
+ | * then document-document similarity would be a self-join of Document with itself | ||
+ | |||
+ | |||
+ | === [[MapReduce]] === | ||
+ | How do we use it to efficiently compute pair-wise similarity between each document | ||
+ | * given two documents $\mathbf d_1$, $\mathbf d_2$: | ||
+ | * $\text{sim}(\mathbf d_1, \mathbf d_2) = \sum_{t \in V} = w_{t, \mathbf d_i} \cdot w_{t, \mathbf d_j} = \sum_{t \in \mathbf d_i \cup \mathbf d_j} = w_{t, \mathbf d_i} \cdot w_{t, \mathbf d_j} $ | ||
+ | * so we need to take into account only terms that both documents share | ||
+ | |||
+ | |||
+ | If we compute similarity for all documents: | ||
+ | * if a term $t$ appears only in documents $\mathbf x, \mathbf y, \mathbf z$, then it contributes only to the similarity scores between $(\mathbf x, \mathbf y), (\mathbf x, \mathbf z)$ and $(\mathbf y, \mathbf z)$ | ||
+ | |||
+ | |||
+ | |||
+ | Algorithm: | ||
+ | * let $\text{posting}(t)$ be a function that returns all documents that contain $t$ | ||
+ | * set $\text{sim}[i, j] = 0$ be the similarity matrix | ||
+ | * for $t \in V$ do: | ||
+ | ** $p_t = \text{posting}(t)$ | ||
+ | ** for all pairs $(\mathbf d_i, \mathbf d_j) \in (p_t \times p_t)$ (s.t. $i > j$) | ||
+ | ** $\text{sim}[i, j] = \text{sim}[i, j] + w_{t, \mathbf d_i} \cdot w_{t, \mathbf d_j}$ | ||
+ | |||
+ | |||
+ | It can easily be implemented in [[MapReduce]] | ||
+ | * first MR job: build the index | ||
+ | * second MR job: compute pair-wise similarity | ||
Line 47: | Line 81: | ||
* [[Information Retrieval (UFRT)]] | * [[Information Retrieval (UFRT)]] | ||
* Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [http://static.msi.umn.edu/rreports/2003/73.pdf] | * Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [http://static.msi.umn.edu/rreports/2003/73.pdf] | ||
− | + | * Elsayed, Tamer, Jimmy Lin, and Douglas W. Oard. "Pairwise document similarity in large collections with MapReduce." 2008. [[http://www.ece.umd.edu/~oard/pdf/acl08elsayed2.pdf]] | |
[[Category:Information Retrieval]] | [[Category:Information Retrieval]] | ||
[[Category:Database Indexes]] | [[Category:Database Indexes]] |
In Databases, indexing is needed to speed up queries
For IR
Idea:
This, to find docs similar to $d$:
Can be used in Document Clustering to speed up similarity computation
Build a dictionary: a "posting" list
Suppose we have a table Documents(doc_id, word, weight)
How do we use it to efficiently compute pair-wise similarity between each document
If we compute similarity for all documents:
Algorithm:
It can easily be implemented in MapReduce