Curse of Dimensionality

In high dimensional space distances (esp. Euclidean Distance) become less meaningful

  • distance between each pair of point is almost the same
  • for many data distributions and distances


$$\lim_{d \to \infty} \frac{\text{dist}_\max - \text{dist}_\min}{\text{dist}_\min} = 0$$


  • Curse of Dimensionality makes similarity functions behave poorly
  • KNN makes less sense
  • distances become more uniform as dimensionality grows
  • and this makes clustering difficult

Similarity between two point of high dimensionality can be misleading

  • often points may be similar even though they should belong to different clusters


How to deal?

NLP and Embeddings

solution: Word Embeddings

  • embed the words into some dense low-dimensional space
  • express the join distribution in terms of these embeddings


Papers

  • Beyer, Kevin, et al. "When is “nearest neighbor” meaningful?." 1999. [1]
  • Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. "Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering." (2009)


Sources