# ML Wiki

## $K$ Nearest Neighbors

KNN Graph a graph that contains links only between $k$ closest neighborhoods used in

## Machine Learning and Statistics

Simple and popular method for many areas of machine learning, statistics and beyond

### Cluster Analysis

KNN approach is used in SNN Clustering

• can use the number of shared neighbors as a similarity measure

### Probability Density Estimation

• If $k$th nearest neighbor is close, then the region is most likely of high density
• so the distance to $k$th neighbor gives a measure of density of a point
• can use it with Euclidean Distance, Cosine Similarity or SNN Similarity (see SNN Clustering)

## Curse of Dimensionality

Also note that for high dimensional data many distance/similarity measures become less meaningful

## Indexing for KNN Queries

Brute force search for $k$NN takes $O(N)$ where $N$ is the size of the database

• need to use Multi-Dimensional Indexes, for example, trees: Kd-Trees or R-Trees
• however for high dimensional data tree performance degrades from $O(\log N)$ to $O(N)$
• see Weber98: all indexing techniques degrade to linear search for large dimensionality
• also can't use classical hash-based indexes (like Linear Hashing or Extensible Hashing): they aim at exact match and don't handle KNN queries

Trees

Hashes:

### Dealing with High Dimensionality

How to deal with the Curse of Dimensionality?

Pre-aggregating data:

• can use Locality Sensitive Hashing: probabilistic/approximate indexing techniques that return the true KNNs most of the time correctly
• it would put data items into hash buckets, and when we look for KNN of $q$, we look into buckets where $q$ landed
• alternatively, can use canopies

## References

• Weber, Roger, Hans-Jörg Schek, and Stephen Blott. "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces." 1998. [1]
• Nearest-Neighbor Methods in Learning and Vision: Theory and Practice [2]

## Sources

• Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [3]