K Nearest Neighbors
KNN Graph
a graph that contains links only between k closest neighborhoods
used in
Simple and popular method for many areas of machine learning, statistics and beyond
KNN approach is used in SNN Clustering
- can use the number of shared neighbors as a similarity measure
Also note that for high dimensional data many distance/similarity measures become less meaningful
Indexing for KNN Queries
Brute force search for kNN takes O(N) where N is the size of the database
- need to use Multi-Dimensional Indexes, for example, trees: Kd-Trees or R-Trees
- however for high dimensional data tree performance degrades from O(logN) to O(N)
- see Weber98: all indexing techniques degrade to linear search for large dimensionality
- also can't use classical hash-based indexes (like Linear Hashing or Extensible Hashing): they aim at exact match and don't handle KNN queries
Trees
Hashes:
Dealing with High Dimensionality
How to deal with the Curse of Dimensionality?
Pre-aggregating data:
- can use Locality Sensitive Hashing: probabilistic/approximate indexing techniques that return the true KNNs most of the time correctly
- it would put data items into hash buckets, and when we look for KNN of q, we look into buckets where q landed
- alternatively, can use canopies
References
- Weber, Roger, Hans-Jörg Schek, and Stephen Blott. "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces." 1998. [1]
- Nearest-Neighbor Methods in Learning and Vision: Theory and Practice [2]
Sources
- Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [3]
Categories:Machine Learning
Categories:Statistics