Line 1: | Line 1: | ||
+ | |||
== Curse of Dimensionality == | == Curse of Dimensionality == | ||
In high dimensional space distances (esp. [[Euclidean Distance]]) become less meaningful | In high dimensional space distances (esp. [[Euclidean Distance]]) become less meaningful | ||
Line 21: | Line 22: | ||
* [[Subspace Clustering]] | * [[Subspace Clustering]] | ||
+ | == NLP and Embeddings == | ||
+ | * Suppose we want to make a [[Language Model]] | ||
+ | * How to model the [[Joint Distribution]] between many [[Discrete Random Variables]]? | ||
+ | ** e.g. words in a sentence | ||
+ | ** 10 constitutive words in a language with vocabulary 100k - then there are $100\, 000^10 -1 = 10^{50} - 1$ free parameters | ||
− | == | + | solution: [[Word Embeddings]] |
+ | * embed the words into some dense low-dimensional space | ||
+ | * express the join distribution in terms of these embeddings | ||
+ | |||
+ | |||
+ | == Papers == | ||
* Beyer, Kevin, et al. "When is “nearest neighbor” meaningful?." 1999. [http://www.loria.fr/~berger/Enseignement/Master2/Exposes/beyer.pdf] | * Beyer, Kevin, et al. "When is “nearest neighbor” meaningful?." 1999. [http://www.loria.fr/~berger/Enseignement/Master2/Exposes/beyer.pdf] | ||
* Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. "Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering." (2009) | * Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. "Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering." (2009) | ||
+ | |||
== Sources == | == Sources == | ||
Line 30: | Line 42: | ||
* http://en.wikipedia.org/wiki/Clustering_high-dimensional_data | * http://en.wikipedia.org/wiki/Clustering_high-dimensional_data | ||
* Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [http://static.msi.umn.edu/rreports/2003/73.pdf] | * Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [http://static.msi.umn.edu/rreports/2003/73.pdf] | ||
+ | * Bengio, Yoshua. "A neural probabilistic language model.". 2003. [http://www.iro.umontreal.ca/~lisa/publications2/index.php/attachments/single/74] | ||
[[Category:Distances]] | [[Category:Distances]] |
In high dimensional space distances (esp. Euclidean Distance) become less meaningful
$$\lim_{d \to \infty} \frac{\text{dist}_\max - \text{dist}_\min}{\text{dist}_\min} = 0$$
Similarity between two point of high dimensionality can be misleading
How to deal?
solution: Word Embeddings