Cluster Analysis

Clustering is about finding groups of similar objects in the data

How do we define similar?

  • we measure with some similarity/distance function


Application of Clustering

  • Customer Segmentation
  • Clustering for Classification
  • Collaborative Filtering
  • Visualization
  • Document organization and indexing


Similarity measures and Distances:


One-Sided Clustering

Partitioning Clustering

Idea:

  • decompose dataset into $k$ disjoint classes
  • s.t. each item belongs to exactly one class


Algorithms:


Hierarchical Clustering

Idea:

  • build a tree

Main approaches:

  • Agglomerative Clustering
    • at the beginning everything is a cluster on its own
    • merge till have one big cluster
  • Divisive Clustering
    • at the beginning everything belongs to one big cluster
    • split clusters until everything is a cluster on its own


Others:


Density-Based

Cluster neighborhood groups based on some density conditions


Grid-Based

Partition space into finite number of cells and perform clustering there


Other Types

Graph-Based Clustering

apply Graph Partitioning Algorithms:

  • identify clusters by cutting edges from the graph
  • s.t. the sum of cuts is minimal
  • for example, Minimal Cut Algorithm

Algorithms:


Spectral Clustering


Link-Based Clustering

  • can use Web-graph techniques
  • PageRang and HITS are used for Ranking
  • see Oikonomakou2010


Neural Networks Based

Called Self-Organized Maps

Build a Neural Network with 2 layers:

  • input layers: $n$ input nodes
  • output: $k$ nodes: decision regions


Objective we want to maximize:

  • within-region similarities of items


References:


Fuzzy Clustering

sometimes also called "Soft Clustering"

  • Usually clustering is "exclusive": the clustering algorithms assigns each object strictly to cluster
  • but we can remove this restriction and modify the membership function s.t. an object can belong to several clusters
  • so In this type of clustering a data point may belong to several clusters


Membership function

  • computes for each object and for each cluster returns the degree of membership
  • modification of K-Means: Fuzzy C-Means
  • degree of membership to the cluster in C-Means depends on the distance from the document to the cluster centroid


Others:


Probabilistic Clustering

Membership function outputs probabilities of an item belonging to a cluster


Algorithms:


Co-Clustering

One-sided clustering is clustering only rows of data matrix $D$

  • co-clustering: clustering both rows and columns at the same time


Algorithms


Subspace Clustering

Subspace clustering:

  • it's the task of detecting all clusters in all subspaces
  • a data point may belong to many different clusters - with each cluster in some subspace


Sources

  • http://en.wikipedia.org/wiki/Cluster_analysis
  • Jing, Liping. "Survey of text clustering." (2008). [2]
  • Oikonomakou, Nora, and Michalis Vazirgiannis. "A review of web document clustering approaches." Data mining and knowledge discovery handbook. 2010. [3]
  • Xu, Wei, Xin Liu, and Yihong Gong. "Document clustering based on non-negative matrix factorization." 2003. [4]