ML Wiki
Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

KNN

$K$ Nearest Neighbors

KNN Graph a graph that contains links only between $k$ closest neighborhoods used in

Machine Learning and Statistics

Simple and popular method for many areas of machine learning, statistics and beyond

Regression Problem

Classification Problem

Cluster Analysis

KNN approach is used in SNN Clustering

  • can use the number of shared neighbors as a similarity measure

Probability Density Estimation

  • If $k$th nearest neighbor is close, then the region is most likely of high density
  • so the distance to $k$th neighbor gives a measure of density of a point
  • can use it with Euclidean Distance, Cosine Similarity or SNN Similarity (see SNN Clustering)

Curse of Dimensionality

Also note that for high dimensional data many distance/similarity measures become less meaningful

Indexing for KNN Queries

Brute force search for $k$NN takes $O(N)$ where $N$ is the size of the database

  • need to use Multi-Dimensional Indexes, for example, trees: Kd-Trees or R-Trees
  • however for high dimensional data tree performance degrades from $O(\log N)$ to $O(N)$
  • see Weber98: all indexing techniques degrade to linear search for large dimensionality
  • also can’t use classical hash-based indexes (like Linear Hashing or Extensible Hashing): they aim at exact match and don’t handle KNN queries

Trees

Hashes:

Dealing with High Dimensionality

How to deal with the Curse of Dimensionality?

Pre-aggregating data:

  • can use Locality Sensitive Hashing: probabilistic/approximate indexing techniques that return the true KNNs most of the time correctly
  • it would put data items into hash buckets, and when we look for KNN of $q$, we look into buckets where $q$ landed
  • alternatively, can use canopies

References

  • Weber, Roger, Hans-Jörg Schek, and Stephen Blott. “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces.” 1998. [http://www.vldb.org/conf/1998/p194.pdf]
  • Nearest-Neighbor Methods in Learning and Vision: Theory and Practice [http://people.csail.mit.edu/gregory/annbook/book.html]

Sources

  • Ertöz, Levent, Michael Steinbach, and Vipin Kumar. “Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data.” 2003. [http://static.msi.umn.edu/rreports/2003/73.pdf]

Categories:Machine Learning Categories:Statistics