ML Wiki

Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

Semi-Supervised Clustering

cluster-analysis machine-learning

Semi-Supervised Clustering

Semi-supervised clustering is a bridge between Supervised Learning and Cluster Analysis

it’s about learning with both labeled and unlabeled data:
sometimes we have some prior knowledge about clusters, e.g. we could have some label information
such knowledge can be useful in creating clusters - especially when the number of examples is very big

Where these labels come from?

you can sample your data and manually label it
or try to extract the label e.g. from the unstructured data if you have at least some prior knowledge

Approaches

Seeded Approach

Use labeled data to help initialize clusters

it will bias clustering towards a good region in the search space

Papers

Basu, Sugato, Arindam Banerjee, and Raymond Mooney. “Semi-supervised clustering by seeding.” 2002.

Constrained Approach

Force to keep the grouping of labels unchanged

Feedback-Based Approach

First run regular clustering
then adjust clusters based on labeled data
account for user feedback

Probabilistic frameworks

Papers

Basu, Sugato, Mikhail Bilenko, and Raymond J. Mooney. “A probabilistic framework for semi-supervised clustering.” 2004.

Document Classification

It’s also useful for document classification

Papers:

Nigam, Kamal, et al. “Learning to classify text from labeled and unlabeled documents.” (1998). link
Nigam, Kamal, et al. “Text classification from labeled and unlabeled documents using EM.” (2000). link

Sources

Aggarwal, Charu C., and ChengXiang Zhai. “A survey of text clustering algorithms.” Mining Text Data. Springer US, 2012. link
Jing, Liping. “Survey of text clustering.” (2008). link

✏️ Edit on GitHub