ML Wiki
Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

Locality Sensitive Hashing

Locality Sensitive Hashing

In large databases it’s not possible to use brute force search: there’s too much data

  • one way of speeding search up is using Indexing: in particular, most interesting indexes are Multi-Dimensional Indexes
  • but many of these “classical” indexing schemes don’t work for high dimensional data
  • ’'’Locality-Sensitive Hashing’’’ algorithms address this problem:
  • it’s a family of probabilistic/approximate indexing techniques that return the true KNNs most of the time correctly

LSH algorithms:

  • used for quick search of similar entires in larges DBs
  • used for Nearest Neighbor (1-NN) queries as well as in KNN queries
  • these algorithms are ‘‘randomized’’: they don’t guarantee the exact answer, but rather give a high probability to find the correct answer or a close approximation

  • Good similarity/distance function should rank relevant answers much closer to the query than irrelevant
  • if it’s the case, approximate answer should also give good result
  • Outcome: by allowing some small error and storage overhead, can considerably improve the query time

Basic idea

  • hash points s.t. probability of collision is high for similar objects

Applications

  • Finding duplicates and near-duplicates
  • Audio, video, image search
  • Pattern Classification
  • Cluster Analysis

Problem Definition

Nearest Neighbor Search Problem

Notation:

  • $| \mathbf x |_p = \left( \sum_i x_i ^p \right)^{1/p}$ - let $d_p(\mathbf p, \mathbf q) = | \mathbf p - \mathbf q |_p$ - and let $d_H(\mathbf p, \mathbf q)$ be the Hamming Distance: # of bits in which $\mathbf p$ and $\mathbf q$ are different

NN Search problem:

  • Given set $P$ of $n$ objects preprocess $P$ s.t. it’s effective to find $\mathbf p \in P$ closest to a query object $\mathbf q$
  • KNN: return top $k$ closest objects ${ \mathbf p_1 , \ … \ , \mathbf p_k }, \mathbf p_i \in P$

$\varepsilon$-NNS problem:

  • approximate version of NNS
  • $\mathbf p \in P$ closest to a query object $\mathbf q$ s.t. $d(\mathbf p, \mathbf q) \leqslant (1 + \varepsilon) \cdot d(\mathbf p^, \mathbf q)$ where $\mathbf p^$ is the true NN
  • can generalize to KNN as well:
  • find $\mathbf p_1 , \ … \ , \mathbf p_k \in P$ s.t. distance $d(\mathbf p_i, \mathbf q)$ is at most $(1 + \varepsilon)$ to the true $i$th closest $\mathbf p^*_i$

Most Similar Item Problem

  • Given a query point $q$ and a database $D$
  • find $q’ \in D$ s.t. $q’$ is the most similar object to $q$

Brute force solution:

  • try each object in the database and return the closest
  • the complexity of executing this query grows linearly with $N = D $: number of items in the database

Trees

  • e.g. kd-Trees, Quad Trees, R-Trees
  • the complexity is $O(\log N)$
  • problem: when dimensionality is big, they break down, and we end up testing all the nodes - which brings the complexity back to $O(N)$

Hashes

  • A hash table is a structure that allows to map between some symbol and some value
  • a well-designed hash function should separate two close symbols into different buckets of a hash table
  • so hashes are good for exact matches, but we need to get NN match

Cryptographic hashes like MD5 or SHA1:

  • change one bit - get a completely different hash
  • but we want to same same hash for close objects
  • Solution: Use LSH

LSH Problem

  • given $q$ find closest $q’ \in D$
  • with probability $(1 - \delta)$ we return the true NN

Basic idea:

  • use LSH: a special hash function that would put points that are close together to the same point
  • if two points are close together in high-dimensional space, then they should remain close together after some projection to a lower-dimensional space

Illustration:

  • suppose we have a 3D sphere with points on it
  • if we project the sphere on 2D, then close points should still remain close - no matter how we rotate the sphere
  • if points are far apart, usually in the projection they are also far apart, but sometimes they become close
  • Image
  • (Source: fig1 of Slaney2008)

Formalization:

  • let $D(\cdot, \cdot)$ be a distance function of elements from $P$
  • then for each $\mathbf p \in P$ let $B(\mathbf p, r)$ be a set of elements within distance $r$ from $\mathbf p$

A family of hash functions $\mathcal H$ is $(r_1, r_2, p_1, p_2)$-sensitive if

  • for all $\mathbf p, \mathbf q$
  • if $\mathbf p \in B(\mathbf q, r_1)$ then $P_{\mathcal H} \Big[ h(\mathbf q) = h(\mathbf p) \Big] \geqslant p_1$
  • if $\mathbf p \not \in B(\mathbf q, r_2)$ then $P_{\mathcal H} \Big[ h(\mathbf q) = h(\mathbf p) \Big] \leqslant p_2$

LSH:

  • A LSH family is useful when $p_1 > p_2$ and $r_1 < r_2$
  • The algorithin solves $(r, \varepsilon)$-NN problem if
  • if there exists a $\mathbf p^* \in B(\mathbf q, r_1)$ then $g_j(\mathbf p^*) = g_j(\mathbf q)$ for at least one $j$

Common for almost all LSH families. Outline:

  • input space $P$
  • define a family $\mathcal H = { h \, P \to \mathbb R }$ that maps original datapoint to some real value: $h(\mathbf v) \in \mathbb R$
  • next define a family of hash functions $\mathcal G = { g: P \to \mathbb R^k }$ s.t. a $g$ takes $k$ different $h_i \in \mathcal H$ functions:
  • $g(\mathbf v) = [ h_1(\mathbf v), h(\mathbf v), \ … \ , h_k(\mathbf v) ] \in \mathbb R^k$
  • then take $L$ such $g$ functions from $\mathcal G$: $g_1, g_2, \ … \ , g_L$

Preprocessing:

  • store each $\mathbf v \in P$ in bucket $g_j(\mathbf v)$ for all $j = 1 \ .. \ L$
  • (can do second hashing to put $g_j(\mathbf v) \in \mathbb R^k$ to a usual Hash Table)

Querying:

  • given query $\mathbf q$
  • search all buckets $g_1(\mathbf q), g_2(\mathbf q), \ … \ ,g_L(\mathbf q)$
  • optional: if number of possible candidates is too large - interrupt after some time, e.g. after first $3L$ items (including duplicates)
  • then use linear search to find $k$NN

$k$ and $L$ are parameters chosen s.t.:

  • if there exists some $\mathbf v^* \in B(\mathbf q, r_1)$ then $g_j(\mathbf v^*) = g_j(\mathbf q)$ for at least one $g_j$
  • total number of collisions on $\mathbf q$ with points from outside $B(\mathbf q, r_1)$ is less than $3L$

A hash function family is ‘‘locality-sensitive’’ if

  • “similar” $\mathbf v_1, \mathbf v_2$ - e.g. big $\text{sim}(\mathbf v_1, \mathbf v_2)$ for some Similarity Measure or small $d(\mathbf v_1, \mathbf v_2)$ for some distance measure
  • they should collide: have same hash value with high probability

LSH Families

Hamming LSH

  • Or Bit Sampling LSH or “the” LSH (Gionis99)
  • Approximated distance: Hamming Distance

E2 LSH

  • Or p-stable LSH or random projection/quantization LSH
  • Euclidean LSH often called E2LSH in the literature
  • Approximated Distance: Euclidean Distance

MinHash

  • aka Min-Wise independent permutations
  • Approximated similarity: Jaccard

SimHash

  • http://matpalm.com/resemblance/simhash/

Random Binary Projection

  • Approximated similarity: Cosine
  • Approximated Distance: Cosine Distance (1 - cosine)
  • http://stackoverflow.com/questions/12952729/how-to-understand-locality-sensitive-hashing
  • http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL10-slides.pdf

K-Means LSH

  • uses the structure of underlying data to find the best hash functions
  • “learns” the hash functions via K-Means

Bayesian LSH

  • Satuluri, V., Parthasarathy, S. “Bayesian locality sensitive hashing for fast similarity search.” 2012. [http://arxiv.org/abs/1110.1328]

Fewer Hash Tables

Ways to improve LSH

  • usually applicable to different LSH Families

Drawback of conventional LSH schemes:

  • to guarantee a good quality need to have many hash tables
  • it’s a large space requirement for index
  • also, in distributed settings it leads to large network load: each hash bucket lookup requires a communication over the network

Hash tables reducing techniques:

  • the hash tables occupy a lot of space
  • can we have fewer tables but still maintain the same level of performance?

Space requirements:

  • each hash table has the same size as the entire dataset (i.e. if dataset has $N$ entries, then the hash index also has $N$ entries)

Entropy LSH

Reduces the number of hash tables

  • the indexing (hashing) stage stays the same, but $L$ is smaller than usual

uses different procedure for querying:

  • hashes $\mathbf q$ as usual LSH
  • but also hashes the query offsets - and sees where the offsets map to

Idea:

  • close objects are likely to be in the same bucket as the query - but also in the buckets nearby
  • generate query offsets to hit the nearby buckets as well
  • this reduces the number of hash tables

But:

  • it doesn’t help with network efficiency:
  • all $\mathbf q$ + offsets each require a network call
  • number of query offsets required by Entropy LSH is larger than the number of hashtables in the original scheme: it’s even less network efficient than usual LSHs

Panigrahy, Rina. “Entropy based nearest neighbor search in high dimensions.” 2006. [http://arxiv.org/abs/cs/0510019]

Multi-Probe LSH

How to increase the quality of LSH?

Probe multiple times

  • it increases the scope of search and gives better Precision and Recall
  • idea: try to look for hash buckets nearby

Lv, Qin, et al. “Multi-probe LSH: efficient indexing for high-dimensional similarity search.” 2007.

Distributed Layered LSH

Layered LSH

  • carefully designed implementation of Entropy LSH

idea:

  • distribute hash buckets in such a way that near points are likely to be on the same machine (so get network efficiency)
  • while fair away points are likely to be on different machines (so get load balance)

Achieved by rehashing

  • rehash the bucket IDs for both data and query via additional layer of LSH
  • and then use the hashed buckets as keys

Reference:

  • Bahmani, Bahman, Ashish Goel, and Rajendra Shinde. “Efficient distributed locality sensitive hashing.” 2012. [http://arxiv.org/abs/1210.7057]
  • there’s a MapReduce implementation and Storm implementation in the paper

Other Meta LSHs

Query-Adaptive LSH

This method adapts its behavior:

  • it picks those hash functions that are most likely to return the NN

Usual Query-Adaptive LSH:

  • define a pool of $L$ hash functions (with $L$ larger than in usual LSHs)
  • compute relevance criteria $\lambda_i$ for each $g_i$: this criteria identifies the hash functions that are more likely to return the NNs
  • relevance could be: distance between the query and the center of the cell

Jégou, Hervé, et al. “Query adaptative locality sensitive hashing.” 2008. https://hal.inria.fr/inria-00318614/document

Comparisons

  • http://arxiv.org/abs/1206.2082
  • minhash vs simhash: Henzinger, Monika (2006), “Finding near-duplicate web pages: a large-scale evaluation of algorithms” [http://infoscience.epfl.ch/record/99373/files/Henzinger06.pdf]

Sources

  • Slaney, Malcolm, and Michael Casey. “Locality-sensitive hashing for finding nearest neighbors [lecture notes].” 2008. [http://web.iitd.ac.in/~sumeet/Slaney2008-LSHTutorial.pdf]
  • Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. “Similarity search in high dimensions via hashing.” 1999. [http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf]
  • Datar, Mayur, et al. “Locality-sensitive hashing scheme based on p-stable distributions.” 2004. [http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf]
  • Paulevé, Loïc, et al. “Locality sensitive hashing: A comparison of hash function types and querying mechanisms.” 2010. [https://hal.inria.fr/inria-00567191/document]