Cluster Analysis
Clustering is about finding groups of similar objects in the data
How do we define similar?
- we measure with some similarity/distance function
Application of Clustering
- Customer Segmentation
- Clustering for Classification
- Collaborative Filtering
- Visualization
- Document organization and indexing
Similarity measures and Distances:
One-Sided Clustering
Partitioning Clustering
Idea:
- decompose dataset into $k$ disjoint classes
- s.t. each item belongs to exactly one class
Algorithms:
- K-Means and variants like K-Medoids
- CURE Clustering
Hierarchical Clustering
Idea:
- build a tree
Main approaches:
- Agglomerative Clustering
- at the beginning everything is a cluster on its own
- merge till have one big cluster
- Divisive Clustering
- at the beginning everything belongs to one big cluster
- split clusters until everything is a cluster on its own
Others:
Density-Based
Cluster neighborhood groups based on some density conditions
- DBSCAN
- SNN Clustering extension of DBSCAN with notion of similarity based on $k$-nearest neighbors
Grid-Based
Partition space into finite number of cells and perform clustering there
Other Types
Graph-Based Clustering
apply Graph Partitioning Algorithms:
- identify clusters by cutting edges from the graph
- s.t. the sum of cuts is minimal
- for example, Minimal Cut Algorithm
Algorithms:
- apply Graph Partitioning but in some high-dimensional space
- usually involves computing Singular Values and Vectors / Eigenvalues and Eigenvectors of the graph affinity matrix
- usually has global optimum
- criteria: Average Cut, Average Association, Normalized Cut, Min-Max Cut
- when applied to documents, under certain conditions resulting eigenspaces are equivalent to semantic spaces found by Latent Semantic Analysis
- but also like in LSA, usually found directions don’t correspond to clusters directly, need to do additional clustering afterwards (e.g. by K-Means)
Link-Based Clustering
Neural Networks Based
Called Self-Organized Maps
Build a Neural Network with 2 layers:
- input layers: $n$ input nodes
- output: $k$ nodes: decision regions
Objective we want to maximize:
- within-region similarities of items
References:
- https://en.wikipedia.org/wiki/Self-organizing_map
Fuzzy Clustering
sometimes also called “Soft Clustering”
- Usually clustering is “exclusive”: the clustering algorithms assigns each object strictly to cluster
- but we can remove this restriction and modify the membership function s.t. an object can belong to several clusters
- so In this type of clustering a data point may belong to several clusters
Membership function
- computes for each object and for each cluster returns the degree of membership
- modification of K-Means: Fuzzy C-Means
- degree of membership to the cluster in C-Means depends on the distance from the document to the cluster centroid
Others:
- clustering via Non-Negative Matrix Factorization can be fuzzy
- also, Looney, Carl G. “A fuzzy clustering and fuzzy merging algorithm.” 1999. [https://www.researchgate.net/publication/2467319_A_Fuzzy_Clustering_and_Fuzzy_Merging_Algorithm]
Probabilistic Clustering
Membership function outputs probabilities of an item belonging to a cluster
Algorithms:
Co-Clustering
One-sided clustering is clustering only rows of data matrix $D$
- co-clustering: clustering both rows and columns at the same time
Algorithms
- clustering via Non-Negative Matrix Factorization can be viewed as clustering both columns and rows
Subspace Clustering
Subspace clustering:
- it’s the task of detecting all clusters in all subspaces
- a data point may belong to many different clusters - with each cluster in some subspace
Sources
- http://en.wikipedia.org/wiki/Cluster_analysis
- Jing, Liping. “Survey of text clustering.” (2008). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.3476&rep=rep1&type=pdf]
- Oikonomakou, Nora, and Michalis Vazirgiannis. “A review of web document clustering approaches.” Data mining and knowledge discovery handbook. 2010. [https://scholar.google.com/scholar?cluster=1261203777431390097&hl=ru&as_sdt=0,5]
- Xu, Wei, Xin Liu, and Yihong Gong. “Document clustering based on non-negative matrix factorization.” 2003. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.2293&rep=rep1&type=pdf]