Semi-Supervised Clustering

Semi-supervised clustering is a bridge between Supervised Learning and Cluster Analysis

  • it's about learning with both labeled and unlabeled data:
  • sometimes we have some prior knowledge about clusters, e.g. we could have some label information
  • such knowledge can be useful in creating clusters - especially when the number of examples is very big


Where these labels come from?

  • you can sample your data and manually label it
  • or try to extract the label e.g. from the unstructured data if you have at least some prior knowledge


Approaches

Seeded Approach

Use labeled data to help initialize clusters

  • it will bias clustering towards a good region in the search space


Papers

  • Basu, Sugato, Arindam Banerjee, and Raymond Mooney. "Semi-supervised clustering by seeding." 2002.


Constrained Approach

Force to keep the grouping of labels unchanged


Feedback-Based Approach

  • First run regular clustering
  • then adjust clusters based on labeled data
  • account for user feedback


Probabilistic frameworks

Papers

  • Basu, Sugato, Mikhail Bilenko, and Raymond J. Mooney. "A probabilistic framework for semi-supervised clustering." 2004.


Document Classification

It's also useful for document classification

Papers:

  • Nigam, Kamal, et al. "Learning to classify text from labeled and unlabeled documents." (1998). [1]
  • Nigam, Kamal, et al. "Text classification from labeled and unlabeled documents using EM." (2000). [2]


Sources

  • Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [3]
  • Jing, Liping. "Survey of text clustering." (2008). [4]