ML Wiki

SNN Clustering

The goal:

find clusters of different shapes, sizes and densities in high-dimensional data
DBSCAN is good for finding clusters of different shapes and sizes, but it fails to find clusters with different densities
it will find only one cluster:
(figure source: Ertöz2003)

Distance:

"Jarvis-Patrick" algorithm, as in Jarvis1973

Step 1: SNN sparsification:

Step 2: Weighting

weight the links with $\text{sim}(p, q) = \big| \, \text{NN}(p) \ \cup \ \text{NN}(q) \, \big|$
where $\text{NN(p)}$ and $\text{NN(q)}$ are $k$ neighbors of $p$ and $q$ resp.

Step 3: Filtering

Step 4: Clusters

Illustration

(figure source: Ertöz2003)
note that this procedure removed the noise
and clusters are of uniform density: it breaks the links in the transition regions

Usual density is not good:

In the Euclidean space, the density is the number of points per unit volume
but as dimensionality increases, the volume increases rapidly
so unless the number of points increases exponentially with dimensionality, the density tends to 0
Density-based algorithms (e.g. DBSCAN) will not work properly

Need different intuition of density

can use a related concept from
if $k$th nearest neighbor is close, then the region is most likely of high density
so the distance to $k$th neighbor gives a measure of density of a point
because of the Curse of Dimensionality, the approach is not good for Euclidean Distance, Cosine Similarity or others
but we can use the SNN-Similarity to define density

SSN-based measures of density:

sum of SSN similarities over all KNNs
- why sum an not just $k$th?
- to reduce random variation - which happens when we look only at one point
- to be consistent with the graph-based view of the problem
of it can be the number of points within some radius - specified in terms of SNN distance
- like in DBSCAN, but with SSN distance

SNN Clustering algorithm is a combination of

Parameters

Steps:

compute the similarity matrix
sparsify the matrix by keeping only $k$ most similar neighbors for each data point
construct the SSN graph (use the Jarvis-Patrick algo)
find SSN density of each point $p$:
- in the KNN list of $p$ count $q$ s.t. $\text{sim}(p, q) \geqslant \epsilon$
find the core points
- all points with SSN density greater than $\text{min_pts}$ are the core ones
form clusters from the core points
- all non-core points not within $\epsilon$ from the core ones are discarded as noise
align non-noise non-core points to clusters

Parameter tuning:

Jarvis, Raymond A., and Edward A. Patrick. "Clustering using a similarity measure based on shared near neighbors." (1973). [1]
See also: Houle, Michael E., et al. "Can shared-neighbor distances defeat the curse of dimensionality?." 2010. [2]

Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [3]