Clustering is about finding groups of similar objects in the data

How do we define similar?

- we measure with some similarity/distance function

Application of Clustering

- Customer Segmentation
- Clustering for Classification
- Collaborative Filtering
- Visualization
- Document organization and indexing

Similarity measures and Distances:

Idea:

- decompose dataset into $k$ disjoint classes
- s.t. each item belongs to exactly one class

Algorithms:

- K-Means and variants like K-Medoids
- CURE Clustering

Idea:

- build a tree

Main approaches:

- Agglomerative Clustering
- at the beginning everything is a cluster on its own
- merge till have one big cluster

- Divisive Clustering
- at the beginning everything belongs to one big cluster
- split clusters until everything is a cluster on its own

Others:

Cluster neighborhood groups based on some density conditions

- DBSCAN
- SNN Clustering extension of DBSCAN with notion of similarity based on $k$-nearest neighbors

Partition space into finite number of cells and perform clustering there

apply Graph Partitioning Algorithms:

- identify clusters by cutting edges from the graph
- s.t. the sum of cuts is minimal
- for example, Minimal Cut Algorithm

Algorithms:

- apply Graph Partitioning but in some high-dimensional space
- usually involves computing Singular Values and Vectors / Eigenvalues and Eigenvectors of the graph affinity matrix
- usually has global optimum
- criteria: Average Cut, Average Association, Normalized Cut, Min-Max Cut
- when applied to documents, under certain conditions resulting eigenspaces are equivalent to semantic spaces found by Latent Semantic Analysis
- but also like in LSA, usually found directions don't correspond to clusters directly, need to do additional clustering afterwards (e.g. by K-Means)

Link-Based Clustering

Called Self-Organized Maps

Build a Neural Network with 2 layers:

- input layers: $n$ input nodes
- output: $k$ nodes: decision regions

Objective we want to maximize:

- within-region similarities of items

References:

sometimes also called "Soft Clustering"

- Usually clustering is "exclusive": the clustering algorithms assigns each object strictly to cluster
- but we can remove this restriction and modify the membership function s.t. an object can belong to several clusters
- so In this type of clustering a data point may belong to several clusters

Membership function

- computes for each object and for each cluster returns the degree of membership
- modification of K-Means: Fuzzy C-Means
- degree of membership to the cluster in C-Means depends on the distance from the document to the cluster centroid

Others:

- clustering via Non-Negative Matrix Factorization can be fuzzy
- also, Looney, Carl G. "A fuzzy clustering and fuzzy merging algorithm." 1999. [1]

Membership function outputs probabilities of an item belonging to a cluster

Algorithms:

One-sided clustering is clustering only rows of data matrix $D$

- co-clustering: clustering both rows and columns at the same time

Algorithms

- clustering via Non-Negative Matrix Factorization can be viewed as clustering both columns and rows

Subspace clustering:

- it's the task of detecting all clusters in all subspaces
- a data point may belong to many different clusters - with each cluster in some subspace

- http://en.wikipedia.org/wiki/Cluster_analysis
- Jing, Liping. "Survey of text clustering." (2008). [2]
- Oikonomakou, Nora, and Michalis Vazirgiannis. "A review of web document clustering approaches." Data mining and knowledge discovery handbook. 2010. [3]
- Xu, Wei, Xin Liu, and Yihong Gong. "Document clustering based on non-negative matrix factorization." 2003. [4]