# ML Wiki

## Curse of Dimensionality

In high dimensional space distances (esp. Euclidean Distance) become less meaningful

• distance between each pair of point is almost the same
• for many data distributions and distances

$$\lim_{d \to \infty} \frac{\text{dist}_\max - \text{dist}_\min}{\text{dist}_\min} = 0$$

• Curse of Dimensionality makes similarity functions behave poorly
• KNN makes less sense
• distances become more uniform as dimensionality grows
• and this makes clustering difficult

Similarity between two point of high dimensionality can be misleading

• often points may be similar even though they should belong to different clusters

How to deal?

## NLP and Embeddings

solution: Word Embeddings

• embed the words into some dense low-dimensional space
• express the join distribution in terms of these embeddings

## Papers

• Beyer, Kevin, et al. "When is “nearest neighbor” meaningful?." 1999. [1]
• Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. "Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering." (2009)