Cluster Analysis

cluster-analysis machine-learning unsupervised-learning

Cluster Analysis

Clustering is about finding groups of similar objects in the data

How do we define similar?

we measure with some similarity/distance function

Application of Clustering

Customer Segmentation
Clustering for Classification
Collaborative Filtering
Visualization
Document organization and indexing

Similarity measures and Distances:

One-Sided Clustering

Partitioning Clustering

Idea:

decompose dataset into $k$ disjoint classes
s.t. each item belongs to exactly one class

Algorithms:

K-Means and variants like K-Medoids
CURE Clustering

Hierarchical Clustering

Idea:

build a tree

Main approaches:

Agglomerative Clustering
- at the beginning everything is a cluster on its own
- merge till have one big cluster
Divisive Clustering
- at the beginning everything belongs to one big cluster
- split clusters until everything is a cluster on its own

Others:

Chameleon Clustering

Density-Based

Cluster neighborhood groups based on some density conditions

DBSCAN
SNN Clustering extension of DBSCAN with notion of similarity based on $k$-nearest neighbors

Grid-Based

Partition space into finite number of cells and perform clustering there

Other Types

Graph-Based Clustering

apply Graph Partitioning Algorithms:

identify clusters by cutting edges from the graph
s.t. the sum of cuts is minimal
for example, Minimal Cut Algorithm

Algorithms:

Chameleon Clustering

Spectral Clustering

apply Graph Partitioning but in some high-dimensional space
usually involves computing Singular Values and Vectors / Eigenvalues and Eigenvectors of the graph affinity matrix
usually has global optimum
criteria: Average Cut, Average Association, Normalized Cut, Min-Max Cut
when applied to documents, under certain conditions resulting eigenspaces are equivalent to semantic spaces found by Latent Semantic Analysis
but also like in LSA, usually found directions don’t correspond to clusters directly, need to do additional clustering afterwards (e.g. by K-Means)

Link-Based Clustering

can use Web-graph techniques
PageRang and HITS are used for Ranking
see Oikonomakou2010

Neural Networks Based

Called Self-Organized Maps

Build a Neural Network with 2 layers:

input layers: $n$ input nodes
output: $k$ nodes: decision regions

Objective we want to maximize:

within-region similarities of items

References:

https://en.wikipedia.org/wiki/Self-organizing_map

Fuzzy Clustering

sometimes also called “Soft Clustering”

Usually clustering is “exclusive”: the clustering algorithms assigns each object strictly to cluster
but we can remove this restriction and modify the membership function s.t. an object can belong to several clusters
so In this type of clustering a data point may belong to several clusters

Membership function

computes for each object and for each cluster returns the degree of membership
modification of K-Means: Fuzzy C-Means
degree of membership to the cluster in C-Means depends on the distance from the document to the cluster centroid

Others:

clustering via Non-Negative Matrix Factorization can be fuzzy
also, Looney, Carl G. “A fuzzy clustering and fuzzy merging algorithm.” 1999. [https://www.researchgate.net/publication/2467319_A_Fuzzy_Clustering_and_Fuzzy_Merging_Algorithm]

Probabilistic Clustering

Membership function outputs probabilities of an item belonging to a cluster

Algorithms:

Co-Clustering

One-sided clustering is clustering only rows of data matrix $D$

co-clustering: clustering both rows and columns at the same time

Algorithms

clustering via Non-Negative Matrix Factorization can be viewed as clustering both columns and rows

Subspace Clustering

Subspace clustering:

it’s the task of detecting all clusters in all subspaces
a data point may belong to many different clusters - with each cluster in some subspace

Sources

http://en.wikipedia.org/wiki/Cluster_analysis
Jing, Liping. “Survey of text clustering.” (2008). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.3476&rep=rep1&type=pdf]
Oikonomakou, Nora, and Michalis Vazirgiannis. “A review of web document clustering approaches.” Data mining and knowledge discovery handbook. 2010. [https://scholar.google.com/scholar?cluster=1261203777431390097&hl=ru&as_sdt=0,5]
Xu, Wei, Xin Liu, and Yihong Gong. “Document clustering based on non-negative matrix factorization.” 2003. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.2293&rep=rep1&type=pdf]

✏️ Edit on GitHub