Cluster Analysis
Clustering is about finding groups of similar objects in the data
How do we define similar?
- we measure with some similarity/distance function
Application of Clustering
- Customer Segmentation
- Clustering for Classification
- Collaborative Filtering
- Visualization
- Document organization and indexing
Similarity measures and Distances:
One-Sided Clustering
Partitioning Clustering
Idea:
- decompose dataset into $k$ disjoint classes
- s.t. each item belongs to exactly one class
Algorithms:
Idea:
Main approaches:
- Agglomerative Clustering
- at the beginning everything is a cluster on its own
- merge till have one big cluster
- Divisive Clustering
- at the beginning everything belongs to one big cluster
- split clusters until everything is a cluster on its own
Others:
Density-Based
Cluster neighborhood groups based on some density conditions
Grid-Based
Partition space into finite number of cells and perform clustering there
Other Types
Graph-Based Clustering
apply Graph Partitioning Algorithms:
- identify clusters by cutting edges from the graph
- s.t. the sum of cuts is minimal
- for example, Minimal Cut Algorithm
Algorithms:
Spectral Clustering
Link-Based Clustering
- can use Web-graph techniques
- PageRang and HITS are used for Ranking
- see Oikonomakou2010
Called Self-Organized Maps
Build a Neural Network with 2 layers:
- input layers: $n$ input nodes
- output: $k$ nodes: decision regions
Objective we want to maximize:
- within-region similarities of items
References:
sometimes also called "Soft Clustering"
- Usually clustering is "exclusive": the clustering algorithms assigns each object strictly to cluster
- but we can remove this restriction and modify the membership function s.t. an object can belong to several clusters
- so In this type of clustering a data point may belong to several clusters
Membership function
- computes for each object and for each cluster returns the degree of membership
- modification of K-Means: Fuzzy C-Means
- degree of membership to the cluster in C-Means depends on the distance from the document to the cluster centroid
Others:
Membership function outputs probabilities of an item belonging to a cluster
Algorithms:
One-sided clustering is clustering only rows of data matrix $D$
- co-clustering: clustering both rows and columns at the same time
Algorithms
Subspace clustering:
- it's the task of detecting all clusters in all subspaces
- a data point may belong to many different clusters - with each cluster in some subspace
Sources
- http://en.wikipedia.org/wiki/Cluster_analysis
- Jing, Liping. "Survey of text clustering." (2008). [2]
- Oikonomakou, Nora, and Michalis Vazirgiannis. "A review of web document clustering approaches." Data mining and knowledge discovery handbook. 2010. [3]
- Xu, Wei, Xin Liu, and Yihong Gong. "Document clustering based on non-negative matrix factorization." 2003. [4]