ML Wiki
Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

Cluster Analysis

Cluster Analysis

Clustering is about finding groups of similar objects in the data

How do we define similar?

  • we measure with some similarity/distance function

Application of Clustering

  • Customer Segmentation
  • Clustering for Classification
  • Collaborative Filtering
  • Visualization
  • Document organization and indexing

Similarity measures and Distances:

One-Sided Clustering

Partitioning Clustering

Idea:

  • decompose dataset into $k$ disjoint classes
  • s.t. each item belongs to exactly one class

Algorithms:

Hierarchical Clustering

Idea:

  • build a tree

Main approaches:

  • Agglomerative Clustering
    • at the beginning everything is a cluster on its own
    • merge till have one big cluster
  • Divisive Clustering
    • at the beginning everything belongs to one big cluster
    • split clusters until everything is a cluster on its own

Others:

Density-Based

Cluster neighborhood groups based on some density conditions

Grid-Based

Partition space into finite number of cells and perform clustering there

Other Types

Graph-Based Clustering

apply Graph Partitioning Algorithms:

  • identify clusters by cutting edges from the graph
  • s.t. the sum of cuts is minimal
  • for example, Minimal Cut Algorithm

Algorithms:

Spectral Clustering

  • apply Graph Partitioning but in some high-dimensional space
  • usually involves computing Singular Values and Vectors / Eigenvalues and Eigenvectors of the graph affinity matrix
  • usually has global optimum
  • criteria: Average Cut, Average Association, Normalized Cut, Min-Max Cut
  • when applied to documents, under certain conditions resulting eigenspaces are equivalent to semantic spaces found by Latent Semantic Analysis
  • but also like in LSA, usually found directions don’t correspond to clusters directly, need to do additional clustering afterwards (e.g. by K-Means)

Link-Based Clustering

  • can use Web-graph techniques
  • PageRang and HITS are used for Ranking
  • see Oikonomakou2010

Neural Networks Based

Called Self-Organized Maps

Build a Neural Network with 2 layers:

  • input layers: $n$ input nodes
  • output: $k$ nodes: decision regions

Objective we want to maximize:

  • within-region similarities of items

References:

  • https://en.wikipedia.org/wiki/Self-organizing_map

Fuzzy Clustering

sometimes also called “Soft Clustering”

  • Usually clustering is “exclusive”: the clustering algorithms assigns each object strictly to cluster
  • but we can remove this restriction and modify the membership function s.t. an object can belong to several clusters
  • so In this type of clustering a data point may belong to several clusters

Membership function

  • computes for each object and for each cluster returns the degree of membership
  • modification of K-Means: Fuzzy C-Means
  • degree of membership to the cluster in C-Means depends on the distance from the document to the cluster centroid

Others:

  • clustering via Non-Negative Matrix Factorization can be fuzzy
  • also, Looney, Carl G. “A fuzzy clustering and fuzzy merging algorithm.” 1999. [https://www.researchgate.net/publication/2467319_A_Fuzzy_Clustering_and_Fuzzy_Merging_Algorithm]

Probabilistic Clustering

Membership function outputs probabilities of an item belonging to a cluster

Algorithms:

Co-Clustering

One-sided clustering is clustering only rows of data matrix $D$

  • co-clustering: clustering both rows and columns at the same time

Algorithms

Subspace Clustering

Subspace clustering:

  • it’s the task of detecting all clusters in all subspaces
  • a data point may belong to many different clusters - with each cluster in some subspace

Sources

  • http://en.wikipedia.org/wiki/Cluster_analysis
  • Jing, Liping. “Survey of text clustering.” (2008). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.3476&rep=rep1&type=pdf]
  • Oikonomakou, Nora, and Michalis Vazirgiannis. “A review of web document clustering approaches.” Data mining and knowledge discovery handbook. 2010. [https://scholar.google.com/scholar?cluster=1261203777431390097&hl=ru&as_sdt=0,5]
  • Xu, Wei, Xin Liu, and Yihong Gong. “Document clustering based on non-negative matrix factorization.” 2003. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.2293&rep=rep1&type=pdf]