Euclidean Distance

distances norms

Euclidean distance is a geometric Distance between two datapoints

distance between $\mathbf x_1$ and $\mathbf x_2$ is
length of the line that connects these two points:
$| \mathbf x_1 - \mathbf x_2 | = \sqrt{ (\mathbf x_1 - \mathbf x_2)^T (\mathbf x_1 - \mathbf x_2) } = \sqrt{\sum_i (x_{1i} - x_{2i})^2}$

It’s the most common distance metric, also called $L_2$ norm

The Euclidean distance is translation invariant

let $\mathbf a$ be some vector
then consider the distance in the translated space :
$| (\mathbf x_1 - \mathbf a) - (\mathbf x_2 - \mathbf a) | = | \mathbf x_1 - \mathbf a - \mathbf x_2 + \mathbf a | = | \mathbf x_1 - \mathbf x_2 |$
so the distance in the translated space is the same as in the original space

High Dimensionality

Euclidean distance is not always meaningful for high dimensional data

Consider this example:

$A_1$

$A_2$

$A_3$

$A_4$

$A_5$

$A_6$

$A_7$

$A_8$

$A_9$

$A_{10}$

$\mathbf p_1$

$\mathbf p_2$

$\mathbf p_3$

$\mathbf p_4$

distance between $\mathbf p_1$ and $\mathbf p_2$ is $| \mathbf p_1 - \mathbf p_2| = 5$
distance between $\mathbf p_3$ and $\mathbf p_4$ is $| \mathbf p_3 - \mathbf p_4| = 5$
so Euclidean distance between these two vectors is the same
but suppose these vectors correspond to documents and words (Vector Space Models)
$\mathbf p_3$ and $\mathbf p_4$ must be more similar to each other than $\mathbf p_1$ and $\mathbf p_2$: $\mathbf p_3$ and $\mathbf p_4$ have 7 words in common whereas $\mathbf p_1$ and $\mathbf p_2$ have none

When the data is sparse it’s better to use different measure of distance/similarity

http://en.wikipedia.org/wiki/Euclidean_distance
Ertöz, Levent, Michael Steinbach, and Vipin Kumar. “Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data.” 2003. link
Korenius, Tuomo, Jorma Laurikkala, and Martti Juhola. “On principal component analysis, cosine and Euclidean measures in information retrieval.” 2007. link

✏️ Edit on GitHub