Scatter/Gather

This is

Scatter/Gather is a variation of K-Means used for Document Clustering with

  • special seed selection
  • usual k-means
  • then several cluster refinement operations


Idea:

  • use some hierarchical clustering algorithm on a sample to find good initial seeds
  • use K-Means afterwards


Notation:

  • Let $k$ be the number of clusters we want to select and
  • $n$ the number of observations we want to cluster


Seed Selection

Buckshot

  • sample $\sqrt{k \cdot n}$ observations
  • use hierarchical clustering to get $k$ seeds from them


Fractionation

  • break dataset into $n / m$ buckets of size $m > k$ each
  • apply hierarchical clustering to each bucket and obtain $v$ clusters inside each bucket
  • then merge all observations in these clusters (within each bucket) into one
  • repeat till we have $k$ seeds


Can do better than randomly grouping observations into buckets

  • for document clustering:
  • try to group documents in such a way that documents inside buckets have common words
  • e.g. sort docs by index of $j$th most common word in them, $j \approx 3$ corresponds to medium frequent word in a doc


K-Means Step

After we did the seed selecting, we apply the usual k-means

  • but, with a difference: the centroid is not a "mean" document
  • centroid is a concatenation of words from all the documents in the cluster


Cluster Refinement

After selecting seeds just do usual K-Means

  • but do some additional operations to improve quality for document clustering


Cluster refinement hypothesis:

  • documents that belong to the same cluster in finer granularity will also occur in the same cluster in coarser granularity


Split Operations

The main idea is to continue splitting after K-Means has finished

  • sometimes clusters as not as granular as we'd like
  • continue splitting them to refine clusters
  • don't need to apply it to all clusters - only to non-coherent ones
  • it will help create more coherent clusters


Algo:

  • select a cluster to split
  • apply buckshot with $k=2$ on this cluster
  • re-cluster around these centers


How to measure "coherence"?

  • compute self-similarity of a cluster:
  • similarity of documents to the centroid cluster
  • or average pairwise similarity within the cluster
  • apply split only to clusters with low self-similarity


Join Operation

The idea:

  • after K-means has finished,
  • join very similar clusters into one


How?

  • compute typical words of each cluster:
  • i.e. examine most frequent words on the centroid
  • two clusters are similar if there's significant overlap between words of these clusters


Challenges

  • if there are many documents, the centroid will contain many word => leads to slowdown
  • solution: use projection techniques like PCA


Sources

  • Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [1]
  • Cutting, Douglass R., et al. "Scatter/gather: A cluster-based approach to browsing large document collections." 1992. [2]