# ML Wiki

## Cluster Analysis

Clustering is about finding groups of similar objects in the data

How do we define similar?

• we measure with some similarity/distance function

Application of Clustering

• Customer Segmentation
• Clustering for Classification
• Collaborative Filtering
• Visualization
• Document organization and indexing

## One-Sided Clustering

### Partitioning Clustering

Idea:

• decompose dataset into $k$ disjoint classes
• s.t. each item belongs to exactly one class

Algorithms:

### Hierarchical Clustering

Idea:

• build a tree

Main approaches:

• Agglomerative Clustering
• at the beginning everything is a cluster on its own
• merge till have one big cluster
• Divisive Clustering
• at the beginning everything belongs to one big cluster
• split clusters until everything is a cluster on its own

Others:

### Density-Based

Cluster neighborhood groups based on some density conditions

### Grid-Based

Partition space into finite number of cells and perform clustering there

## Other Types

### Graph-Based Clustering

apply Graph Partitioning Algorithms:

• identify clusters by cutting edges from the graph
• s.t. the sum of cuts is minimal
• for example, Minimal Cut Algorithm

Algorithms:

• can use Web-graph techniques
• PageRang and HITS are used for Ranking
• see Oikonomakou2010

### Neural Networks Based

Called Self-Organized Maps

Build a Neural Network with 2 layers:

• input layers: $n$ input nodes
• output: $k$ nodes: decision regions

Objective we want to maximize:

• within-region similarities of items

References:

### Fuzzy Clustering

sometimes also called "Soft Clustering"

• Usually clustering is "exclusive": the clustering algorithms assigns each object strictly to cluster
• but we can remove this restriction and modify the membership function s.t. an object can belong to several clusters
• so In this type of clustering a data point may belong to several clusters

Membership function

• computes for each object and for each cluster returns the degree of membership
• modification of K-Means: Fuzzy C-Means
• degree of membership to the cluster in C-Means depends on the distance from the document to the cluster centroid

Others:

### Probabilistic Clustering

Membership function outputs probabilities of an item belonging to a cluster

Algorithms:

### Co-Clustering

One-sided clustering is clustering only rows of data matrix $D$

• co-clustering: clustering both rows and columns at the same time

Algorithms

### Subspace Clustering

Subspace clustering:

• it's the task of detecting all clusters in all subspaces
• a data point may belong to many different clusters - with each cluster in some subspace

## Sources

• http://en.wikipedia.org/wiki/Cluster_analysis
• Jing, Liping. "Survey of text clustering." (2008). [2]
• Oikonomakou, Nora, and Michalis Vazirgiannis. "A review of web document clustering approaches." Data mining and knowledge discovery handbook. 2010. [3]
• Xu, Wei, Xin Liu, and Yihong Gong. "Document clustering based on non-negative matrix factorization." 2003. [4]