Cluster Analysis

Clustering is about finding groups of similar objects in the data

How do we define similar?

  • we measure with some similarity/distance function

Application of Clustering

  • Customer Segmentation
  • Clustering for Classification
  • Collaborative Filtering
  • Visualization
  • Document organization and indexing

Similarity measures and Distances:

One-Sided Clustering

Partitioning Clustering


  • decompose dataset into $k$ disjoint classes
  • s.t. each item belongs to exactly one class


Hierarchical Clustering


  • build a tree

Main approaches:

  • Agglomerative Clustering
    • at the beginning everything is a cluster on its own
    • merge till have one big cluster
  • Divisive Clustering
    • at the beginning everything belongs to one big cluster
    • split clusters until everything is a cluster on its own



Cluster neighborhood groups based on some density conditions


Partition space into finite number of cells and perform clustering there

Other Types

Graph-Based Clustering

apply Graph Partitioning Algorithms:

  • identify clusters by cutting edges from the graph
  • s.t. the sum of cuts is minimal
  • for example, Minimal Cut Algorithm


Spectral Clustering

Link-Based Clustering

  • can use Web-graph techniques
  • PageRang and HITS are used for Ranking
  • see Oikonomakou2010

Neural Networks Based

Called Self-Organized Maps

Build a Neural Network with 2 layers:

  • input layers: $n$ input nodes
  • output: $k$ nodes: decision regions

Objective we want to maximize:

  • within-region similarities of items


Fuzzy Clustering

sometimes also called "Soft Clustering"

  • Usually clustering is "exclusive": the clustering algorithms assigns each object strictly to cluster
  • but we can remove this restriction and modify the membership function s.t. an object can belong to several clusters
  • so In this type of clustering a data point may belong to several clusters

Membership function

  • computes for each object and for each cluster returns the degree of membership
  • modification of K-Means: Fuzzy C-Means
  • degree of membership to the cluster in C-Means depends on the distance from the document to the cluster centroid


Probabilistic Clustering

Membership function outputs probabilities of an item belonging to a cluster



One-sided clustering is clustering only rows of data matrix $D$

  • co-clustering: clustering both rows and columns at the same time


Subspace Clustering

Subspace clustering:

  • it's the task of detecting all clusters in all subspaces
  • a data point may belong to many different clusters - with each cluster in some subspace


  • Jing, Liping. "Survey of text clustering." (2008). [2]
  • Oikonomakou, Nora, and Michalis Vazirgiannis. "A review of web document clustering approaches." Data mining and knowledge discovery handbook. 2010. [3]
  • Xu, Wei, Xin Liu, and Yihong Gong. "Document clustering based on non-negative matrix factorization." 2003. [4]
Share your opinion