Euclidean LSH

Euclidean LSH - LSH for the Euclidean Space


$p$-Stable Distributions

A family of hash functions $\mathcal H$ is $(r_1, r_2, p_1, p_2)$-sensitive if

  • for all $\mathbf p, \mathbf q$
  • if $\mathbf p \in B(\mathbf q, r_1)$ then $P_{\mathcal H} \Big[ h(\mathbf q) = h(\mathbf p) \Big] \geqslant p_1$
  • if $\mathbf p \not \in B(\mathbf q, r_2)$ then $P_{\mathcal H} \Big[ h(\mathbf q) = h(\mathbf p) \Big] \leqslant p_2$

Here: choose $r_1 = R$ and $r_2 = c \cdot R$


A Probability Distribution $D$ over $\mathbb R$ is $p$-stable

  • if there exists $p \geqslant 0$ s.t. for any $n$ real numbers $v_1, ..., v_n$
  • and iid samples from distribution $D$: $\ X_1, X_2, \ ... \ , X_n \sim D$
  • the Random Variable $\sum v_i \, X_i$ follows the same distribution as $\left( \sum |v_i|^p \right)^{1/p} \cdot X = \| \mathbf v \| \cdot X$ where $X \sim D$


Known $p$-stable Distribution:

In CS $p$-stable distributions are useful for Sketching

  • can be used to estimate $\| \mathbf v \|_p$


Papers:

  • Nolan, J. "Stable distributions." 2009. [1] [2]
  • Indyk, Piotr. "Stable distributions, pseudorandom generators, embeddings, and data stream computation." 2006. [3]


E2 LSH

A hash function family is locality-sensitive if

  • "similar" $\mathbf v_1, \mathbf v_2$ - i.e. have small $\| \mathbf v_1 - \mathbf v_2 \|_p$
  • they should collide: have same hash value with high probability


Consider $\mathbf a \cdot \mathbf v$ - it's a projection of $\mathbf v$ to a real line

  • it follows from the $p$-stability that for two $\mathbf v_1, \mathbf v_2$ distance between the projections $\mathbf a \cdot \mathbf v_1 - \mathbf a \cdot \mathbf v_2$ is distributed as $\| \mathbf v_1 - \mathbf v_2 \|_p \cdot X$
  • so if we chop the real line into equi-width segments of $w$, then the hash functions should be locality-preserving


Random Projections

The Dot Product is a projection

  • this is at the core of LSH
  • let $\mathbf v$ be the query and $\mathbf x$ some random vector with each $x_i \sim N(0, 1)$ (components of $\mathbf x$ are sampled from the Normal Distribution)
  • the hash function is then defined as $h(\mathbf v) = \mathbf v \cdot \mathbf x$

Quantization:

  • then we quantize $h(\mathbf v)$ into a set of hash buckets - hoping that nearby items will fall into the same bin
  • i.e. we get the following hash function:
  • $h_{\mathbf x, b}(\mathbf v) = \left\lfloor \cfrac{\mathbf v \cdot \mathbf x + b}{w} \right\rfloor$
  • where $w$ is the length of each quantization bucket
  • and $b$ is a Random Variable sampled from the Uniform Distribution: $b \sim \text{unif}[0, w]$
  • $w$ - quantization step


Requirements for the projection operator

  • let $\mathcal H$ be a family of LSH functions
  • for any $\mathbf p, \mathbf q \in \mathbb R^d$ that are close to each other, i.e. for $\|\mathbf p - \mathbf q \| \leqslant R_1$
  • probability that they end up in the same bucket should be high, i.e. $P_{\mathcal H} \Big[ h(\mathbf p) = h(\mathbf q) \Big] \geqslant p_1$
  • if $\mathbf p, \mathbf q$ are far apart, i.e. $\|\mathbf p - \mathbf q \| \geqslant R_2$
  • probability that they end up in the same bucket is low, i.e. $P_{\mathcal H} \Big[ h(\mathbf p) = h(\mathbf q) \Big] \leqslant p_2$


I.e. $\| h(\mathbf p) - h(\mathbf q) \| \sim \| \mathbf p - \mathbf q \|$ ($\sim$ = "distributed proportionally to")


$L$ Random Projections

Can magnify the difference between $P_1$ and $P_2$ by performing projection on $K$ random directions in parallel

$\left( \cfrac{p_1}{p_2} \right)^K > \cfrac{p_1}{p_2}$


  • So given a query $\mathbf v$ apply $K$ individual dot products
  • and obtain $K$ real numbers $h_i (\mathbf v) \in \mathbb R$ (all quantified - i.e. put into buckets - separately)
  • so $H(\mathbf v) = \Big[ h_1(\mathbf v), h_2(\mathbf v), \ ... \ , h_K(\mathbf v) \Big] \in \mathbb R^K$


Now we say that $\mathbf u$ is a NN candidate for $\mathbf v$ if they end up in the same bucket, i.e. $H(\mathbf u) = H(\mathbf v)$ (for all $h_i(\cdot)$)


Note that for $K$ dot products the probability of having the same bucket is $p_1^K$ - which decreases as we increase $K$

  • to reduce this effect form $L$ independent projections - i.e. we'll have $L$ hash functions $H_1, \ ... \ , H_L$
  • then $\mathbf u$ is a candidate if it ends up in the same bucket as $\mathbf v$ for at least one projection $H_i$
  • it's very unlikely that the true NN will be absent in all $L$ bins


Hash Implementation

(E2 LSH)

$\{ H_i \}$ put each data point into a hash bucket described by $K$-vector

  • this $K$-dim space is very sparse - can densify it
  • use exact hashes - i.e. "classical" Hash Tables for this
  • so let $T_1(\mathbf v) = \mathbf w_2^T H(\mathbf v) \ \mod P_1$ for some random weight vector $\mathbf w$ and a big prime number $P_1$ (which is also the size of the hash table)


Unrelated points may collide to the same bucket

  • how to handle it?
  • have another function $T_2$ with different weights $\mathbf w_2$ and $P_2$
  • let $T_2(\mathbf v)$ be the "fingerprint" of $\mathbf v$
  • we store the fingerprint of $\mathbf v$ in the bin chosen by $T_1$
  • then during retrieval we can find the true match by comparing fingerprints: they are short and their comparison is fast
  • chances that both $T_1$ and $T_2$ collide are very small


Accuracy

  • accuracy is determined by the probability it will find the true NN


$p$-Stable Distributions

A distribution is $p$-stable if

  • for iid sample $X_1, \ ... \ , X_n \sim D$
  • for any $v_1, \ ... \ , v_n \in \mathbb R$
  • $\sum v_i \, X_i$ follows the same distribution as the random variable $\left( \sum_i \|v_i \big \|_p \right)^{1 / p} \cdot X$
  • where $\| \, \cdot \, \|_p$ is $L_p$ norm
  • and $X \sim D$
  • PDF of this RV is $f_p$


For $p=2$ Gaussian is $p$-stable

What it means:

  • $\mathbf u = \| \mathbf p - \mathbf q \|$ will always be small if $\mathbf p$ and $\mathbf q$ are close
  • but because of quantization they may end up in different buckets


Probability of getting to the same bucket:

  • $P_{\mathbf a, b} = P \Big [ h_{\mathbf a, b}(\mathbf p) = h_{\mathbf a, b}(\mathbf q) \Big] = \int\limits_{0}^{w} \frac{1}{\mathbf u} \cdot f_p(\frac{t}{\mathbf u}) \, (1 - \frac{t}{w}) , \text{d}t$


Sources

  • Slaney, Malcolm, and Michael Casey. "Locality-sensitive hashing for finding nearest neighbors [lecture notes]." 2008. [4]
  • Datar, Mayur, et al. "Locality-sensitive hashing scheme based on p-stable distributions." 2004. [5]