Latest revision as of 23:24, 27 April 2017
$K$ Nearest Neighbors
KNN Graph
a graph that contains links only between $k$ closest neighborhoods
used in
Simple and popular method for many areas of machine learning, statistics and beyond
KNN approach is used in SNN Clustering
- can use the number of shared neighbors as a similarity measure
- If $k$th nearest neighbor is close, then the region is most likely of high density
- so the distance to $k$th neighbor gives a measure of density of a point
- can use it with Euclidean Distance, Cosine Similarity or SNN Similarity (see SNN Clustering)
Also note that for high dimensional data many distance/similarity measures become less meaningful
Indexing for KNN Queries
Brute force search for $k$NN takes $O(N)$ where $N$ is the size of the database
- need to use Multi-Dimensional Indexes, for example, trees: Kd-Trees or R-Trees
- however for high dimensional data tree performance degrades from $O(\log N)$ to $O(N)$
- see Weber98: all indexing techniques degrade to linear search for large dimensionality
- also can't use classical hash-based indexes (like Linear Hashing or Extensible Hashing): they aim at exact match and don't handle KNN queries
Trees
Hashes:
Dealing with High Dimensionality
How to deal with the Curse of Dimensionality?
Pre-aggregating data:
- can use Locality Sensitive Hashing: probabilistic/approximate indexing techniques that return the true KNNs most of the time correctly
- it would put data items into hash buckets, and when we look for KNN of $q$, we look into buckets where $q$ landed
- alternatively, can use canopies
References
- Weber, Roger, Hans-Jörg Schek, and Stephen Blott. "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces." 1998. [1]
- Nearest-Neighbor Methods in Learning and Vision: Theory and Practice [2]
Sources
- Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [3]
Categories:Machine Learning
Categories:Statistics