$K$ Nearest Neighbors
KNN Graph a graph that contains links only between $k$ closest neighborhoods used in
Machine Learning and Statistics
Simple and popular method for many areas of machine learning, statistics and beyond
Regression Problem
Classification Problem
Cluster Analysis
KNN approach is used in SNN Clustering
- can use the number of shared neighbors as a similarity measure
Probability Density Estimation
- If $k$th nearest neighbor is close, then the region is most likely of high density
- so the distance to $k$th neighbor gives a measure of density of a point
- can use it with Euclidean Distance, Cosine Similarity or SNN Similarity (see SNN Clustering)
Curse of Dimensionality
Also note that for high dimensional data many distance/similarity measures become less meaningful
- especially Euclidean Distance
- can use special functions that can handle high dimensional data: SNN Similarity (see SNN Clustering)
Indexing for KNN Queries
Brute force search for $k$NN takes $O(N)$ where $N$ is the size of the database
- need to use Multi-Dimensional Indexes, for example, trees: Kd-Trees or R-Trees
- however for high dimensional data tree performance degrades from $O(\log N)$ to $O(N)$
- see Weber98: all indexing techniques degrade to linear search for large dimensionality
- also can’t use classical hash-based indexes (like Linear Hashing or Extensible Hashing): they aim at exact match and don’t handle KNN queries
Trees
- Metric Trees
- Spill-Trees gives approximate answer to KNN
- both don’t work well in high dimensions, but can apply Random Projections to make them work
Hashes:
Dealing with High Dimensionality
How to deal with the Curse of Dimensionality?
Pre-aggregating data:
- can use Locality Sensitive Hashing: probabilistic/approximate indexing techniques that return the true KNNs most of the time correctly
- it would put data items into hash buckets, and when we look for KNN of $q$, we look into buckets where $q$ landed
- alternatively, can use canopies
References
- Weber, Roger, Hans-Jörg Schek, and Stephen Blott. “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces.” 1998. [http://www.vldb.org/conf/1998/p194.pdf]
- Nearest-Neighbor Methods in Learning and Vision: Theory and Practice [http://people.csail.mit.edu/gregory/annbook/book.html]
Sources
- Ertöz, Levent, Michael Steinbach, and Vipin Kumar. “Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data.” 2003. [http://static.msi.umn.edu/rreports/2003/73.pdf]