ML Wiki

Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

KNN

$K$ Nearest Neighbors

KNN Graph a graph that contains links only between $k$ closest neighborhoods used in

Machine Learning and Statistics

Simple and popular method for many areas of machine learning, statistics and beyond

Regression Problem

Classification Problem

Cluster Analysis

KNN approach is used in SNN Clustering

can use the number of shared neighbors as a similarity measure

Probability Density Estimation

If $k$th nearest neighbor is close, then the region is most likely of high density
so the distance to $k$th neighbor gives a measure of density of a point
can use it with Euclidean Distance, Cosine Similarity or SNN Similarity (see SNN Clustering)

Curse of Dimensionality

Also note that for high dimensional data many distance/similarity measures become less meaningful

especially Euclidean Distance
can use special functions that can handle high dimensional data: SNN Similarity (see SNN Clustering)

Indexing for KNN Queries

Brute force search for $k$NN takes $O(N)$ where $N$ is the size of the database

need to use Multi-Dimensional Indexes, for example, trees: Kd-Trees or R-Trees
however for high dimensional data tree performance degrades from $O(\log N)$ to $O(N)$
see Weber98: all indexing techniques degrade to linear search for large dimensionality
also can’t use classical hash-based indexes (like Linear Hashing or Extensible Hashing): they aim at exact match and don’t handle KNN queries

Trees

Metric Trees
Spill-Trees gives approximate answer to KNN
both don’t work well in high dimensions, but can apply Random Projections to make them work

Hashes:

Dealing with High Dimensionality

How to deal with the Curse of Dimensionality?

Pre-aggregating data:

can use Locality Sensitive Hashing: probabilistic/approximate indexing techniques that return the true KNNs most of the time correctly
it would put data items into hash buckets, and when we look for KNN of $q$, we look into buckets where $q$ landed
alternatively, can use canopies

References

Weber, Roger, Hans-Jörg Schek, and Stephen Blott. “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces.” 1998. [http://www.vldb.org/conf/1998/p194.pdf]
Nearest-Neighbor Methods in Learning and Vision: Theory and Practice [http://people.csail.mit.edu/gregory/annbook/book.html]

Sources

Ertöz, Levent, Michael Steinbach, and Vipin Kumar. “Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data.” 2003. [http://static.msi.umn.edu/rreports/2003/73.pdf]

Categories:Machine Learning Categories:Statistics

✏️ Edit on GitHub