Semi-Supervised Clustering
Semi-supervised clustering is a bridge between Supervised Learning and Cluster Analysis
- it's about learning with both labeled and unlabeled data:
- sometimes we have some prior knowledge about clusters, e.g. we could have some label information
- such knowledge can be useful in creating clusters - especially when the number of examples is very big
Where these labels come from?
- you can sample your data and manually label it
- or try to extract the label e.g. from the unstructured data if you have at least some prior knowledge
Approaches
Seeded Approach
Use labeled data to help initialize clusters
- it will bias clustering towards a good region in the search space
Papers
- Basu, Sugato, Arindam Banerjee, and Raymond Mooney. "Semi-supervised clustering by seeding." 2002.
Constrained Approach
Force to keep the grouping of labels unchanged
Feedback-Based Approach
- First run regular clustering
- then adjust clusters based on labeled data
- account for user feedback
Probabilistic frameworks
Papers
- Basu, Sugato, Mikhail Bilenko, and Raymond J. Mooney. "A probabilistic framework for semi-supervised clustering." 2004.
It's also useful for document classification
Papers:
- Nigam, Kamal, et al. "Learning to classify text from labeled and unlabeled documents." (1998). [1]
- Nigam, Kamal, et al. "Text classification from labeled and unlabeled documents using EM." (2000). [2]
Sources
- Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [3]
- Jing, Liping. "Survey of text clustering." (2008). [4]