Line 1: Line 1:
 +
 
== Curse of Dimensionality ==
 
== Curse of Dimensionality ==
 
In high dimensional space distances (esp. [[Euclidean Distance]]) become less meaningful
 
In high dimensional space distances (esp. [[Euclidean Distance]]) become less meaningful
Line 21: Line 22:
 
* [[Subspace Clustering]]
 
* [[Subspace Clustering]]
  
 +
==  NLP and Embeddings ==
 +
* Suppose we want to make a [[Language Model]]
 +
* How to model the [[Joint Distribution]] between many [[Discrete Random Variables]]?
 +
** e.g. words in a sentence
 +
** 10 constitutive words in a language with vocabulary 100k - then there are $100\, 000^10 -1 = 10^{50} - 1$ free parameters
  
== Paper ==
+
solution: [[Word Embeddings]]
 +
* embed the words into some dense low-dimensional space
 +
* express the join distribution in terms of these embeddings
 +
 
 +
 
 +
== Papers ==
 
* Beyer, Kevin, et al. "When is “nearest neighbor” meaningful?." 1999. [http://www.loria.fr/~berger/Enseignement/Master2/Exposes/beyer.pdf]
 
* Beyer, Kevin, et al. "When is “nearest neighbor” meaningful?." 1999. [http://www.loria.fr/~berger/Enseignement/Master2/Exposes/beyer.pdf]
 
* Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. "Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering." (2009)  
 
* Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. "Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering." (2009)  
 +
  
 
== Sources ==
 
== Sources ==
Line 30: Line 42:
 
* http://en.wikipedia.org/wiki/Clustering_high-dimensional_data
 
* http://en.wikipedia.org/wiki/Clustering_high-dimensional_data
 
* Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [http://static.msi.umn.edu/rreports/2003/73.pdf]
 
* Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [http://static.msi.umn.edu/rreports/2003/73.pdf]
 +
* Bengio, Yoshua. "A neural probabilistic language model.". 2003. [http://www.iro.umontreal.ca/~lisa/publications2/index.php/attachments/single/74]
  
 
[[Category:Distances]]
 
[[Category:Distances]]

Latest revision as of 23:13, 27 April 2017

Curse of Dimensionality

In high dimensional space distances (esp. Euclidean Distance) become less meaningful

  • distance between each pair of point is almost the same
  • for many data distributions and distances


$$\lim_{d \to \infty} \frac{\text{dist}_\max - \text{dist}_\min}{\text{dist}_\min} = 0$$


  • Curse of Dimensionality makes similarity functions behave poorly
  • KNN makes less sense
  • distances become more uniform as dimensionality grows
  • and this makes clustering difficult

Similarity between two point of high dimensionality can be misleading

  • often points may be similar even though they should belong to different clusters


How to deal?

NLP and Embeddings

solution: Word Embeddings

  • embed the words into some dense low-dimensional space
  • express the join distribution in terms of these embeddings


Papers

  • Beyer, Kevin, et al. "When is “nearest neighbor” meaningful?." 1999. [1]
  • Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. "Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering." (2009)


Sources