Bit Sampling LSH

  • LSH for the Hamming Distance
  • can convert $L_1$ to Hamming distance
  • NNs are usually the same for $L_1$ and $L_2$
    • see Figiel et al. "The dimension of almost spherical sections of convex bodies." 1977. [1]
  • so it can be ("sort of") used for Euclidean spaces


Hamming LSH

Suppose we have $P = \{ \mathbf p_i \}$ where $\mathbf p_i \in H^{d} = \{0, 1\}^{d}$ - i.t. points are in the (binary) Hamming Space of dimensionality $d$ (or $Cd$, in $C$ chunks of size $d$)


Hash Functions

  • sample $L$ subsets $I_1, \ ... \ , I_L$ of size $k$ uniformly (with replacement) from \{1 \ ... \ d \}$
  • let $g_j(\mathbf p)$ be the projection of $\mathbf p$ on $I_j$: it selects coordinate positions per $I_j$ and concatenates bits on these positions


Indexing

Preprocessing (indexing):

  • store each $\mathbf p \in P$ in the bucket $g_j(\mathbf p)$ for all $j = 1 \, .. \, L$
  • total number of resulting buckets may be large ($g_j(\mathbf p)$'s are sparse), so reduce the desparsify and reduce dimensionality using usual hashing


So have two levels of hashing:

  • LSH Hash to map $\mathbf p$ to bucket $g_j(\mathbf p)$
  • standard hash function to map $g_j(\mathbf p)$ to a hash table of size $M$


Pseudocode:

  • Input: database $P$, number of hash tables $L$
  • Output: $L$ hash tables $\tau_j$
  • generate $L$ random hash functions $g_j(\cdot)$ - for each $\tau_j$
  • for each $\mathbf p_i \in P$ and for each $(g_j, \tau_j)$:
  • $\tau_j\big[ g_j(\mathbf p_i) \big] = \mathbf p_i$


Querying

Querying:

  • given $\mathbf q$
  • search all buckets $g_j(\mathbf q)$ until:
  • either encounter $c \cdot L$ points ($c$ to be specified)
  • or checked all $L$ indexes
  • then for all candidates keep only $K$ closest


Pseudocode:

  • input:
    • a query point $\mathbf q$,
    • number of nearest neighbors $K$,
    • $L$ tables $\tau_i$
  • output: $K$ (or less) NNs
  • let $S \leftarrow \varnothing$ be a candidate list
  • for each $(g_j, \tau_j)$
    • $S \leftarrow S \cup \tau_j \big[ g_j(\mathbf q) \big]$
  • return $K$NNs from $S$ (can be found using linear search)


Parameters

  • $L$: number of subsets $I_j$'s and hence the # of hash functions
  • $k$: size of $I_j$'s: $k$ is chosen s.t.
    • it maximizes the probability that if $\mathbf p$ is close to $\mathbf q$, then they must end up in the same bucket
    • it minimizes the prob that if $\mathbf p$ and $\mathbf q$ ending up in the same bucket when $\mathbf p$ is not close to $\mathbf q$
    • $k \approx 700$ is a good value for $d \approx 64$


Embedding $L_1$ to Hamming

For converting from $L_1$ to Hamming:

  • All $\mathbf p \in P$ are positive integers
  • coordinates may be made all-positive by translating the origin


Let $C$ be the largest coordinate in all points in $P$

  • then we can embed $P$ into a Hamming cube $\{0, 1\}^{Cd}$ where $d$ is the dimensionality of the original space
  • transform each $\mathbf p = (p_1, \ ... \ , p_d$ into a binary vector:
  • let $v(\mathbf p) = \big( \text{unary}_C(p_1), \ ... \ , \text{unary}_C(p_d) \big)$
  • $\text{unary}_C(p)$ denotes the binary representation of $p$: a sequence of $p$ ones followed by $C - p$ zeros
  • e.g. for $C = 10$ and $p = 4$, we'd have $\text{unary}_{10}(4) = 11110 \, 00000$
  • for a vector $\mathbf p = (3, 4, 5)$ we would have $v(\mathbf p) = ( 11100 \, 00000 \, 11110 \, 00000 \, 11111 \, 00000 )$


Now,

  • if $\mathbf p, \mathbf q \in \{1 \, .. \, C \}$
  • then $d_1 \big(\mathbf p, \mathbf q \big) = d_H \big(v(\mathbf p), v(\mathbf q) \big)$
  • i.e. this embedding preserves the $L_1$ distance between points


Mapping $\mathbf p \in \mathbb R^d$ to $\mathbf p' \in H^{Cd}$: (effective hashing calculation)

  • choose $I$: a $k$-vector of indexes sampled from $\{0, \, .. \, C \}$
  • we need to compute a projection on $I$
  • for each component $p_i$ of $\mathbf p$ do:
    • let $I^{(i)}$ denote coordinates of $I$ that correspond to $p_i$
      • these are some of the $C$ coordinates of $p_i$ in the Hamming embedding that got selected in $I$
    • order indexes inside $I^{(i)}$
    • when we project $\mathbf p$ on $I^{(i)}$, the result is a monotone sequence of bits: there are 1's followed by 0's
    • let $o_i$ be the number of 1's of $p_i$
    • let $\mathbf p'_i$ denote the part of $\mathbf p'$ that corresponds to coordinate $p_i$
    • so, to represent $\mathbf p'_i$ it's enough to know only $o_i$
  • thus $\mathbf p'$ can be represented by $\{o_1, o_2, \ ... \ , o_d \}$


Computing $o_i$ fast:

  • there's a way to compute $o_i$ fast
  • finding $o_i$ is equivalent to finding the number of elements in sorted array $I^{(i)}$ which are smaller than $p_i$
  • can be done via Binary Search in $O(\log C)$
  • or even in constant time if we use a precomputed array of $C$ bits


Total projection time

  • then total time to project $\mathbf p$ to $\mathbf p'$ is
  • $O(d \, \log C)$ for Binary Search or
  • $O(d)$ for precomputed arrays


Sources

  • Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." 1999. [2]