This is

Scatter/Gather is a variation of K-Means used for Document Clustering with

  • special seed selection
  • usual k-means
  • then several cluster refinement operations


  • use some hierarchical clustering algorithm on a sample to find good initial seeds
  • use K-Means afterwards


  • Let $k$ be the number of clusters we want to select and
  • $n$ the number of observations we want to cluster

Seed Selection


  • sample $\sqrt{k \cdot n}$ observations
  • use hierarchical clustering to get $k$ seeds from them


  • break dataset into $n / m$ buckets of size $m > k$ each
  • apply hierarchical clustering to each bucket and obtain $v$ clusters inside each bucket
  • then merge all observations in these clusters (within each bucket) into one
  • repeat till we have $k$ seeds

Can do better than randomly grouping observations into buckets

  • for document clustering:
  • try to group documents in such a way that documents inside buckets have common words
  • e.g. sort docs by index of $j$th most common word in them, $j \approx 3$ corresponds to medium frequent word in a doc

K-Means Step

After we did the seed selecting, we apply the usual k-means

  • but, with a difference: the centroid is not a "mean" document
  • centroid is a concatenation of words from all the documents in the cluster

Cluster Refinement

After selecting seeds just do usual K-Means

  • but do some additional operations to improve quality for document clustering

Cluster refinement hypothesis:

  • documents that belong to the same cluster in finer granularity will also occur in the same cluster in coarser granularity

Split Operations

The main idea is to continue splitting after K-Means has finished

  • sometimes clusters as not as granular as we'd like
  • continue splitting them to refine clusters
  • don't need to apply it to all clusters - only to non-coherent ones
  • it will help create more coherent clusters


  • select a cluster to split
  • apply buckshot with $k=2$ on this cluster
  • re-cluster around these centers

How to measure "coherence"?

  • compute self-similarity of a cluster:
  • similarity of documents to the centroid cluster
  • or average pairwise similarity within the cluster
  • apply split only to clusters with low self-similarity

Join Operation

The idea:

  • after K-means has finished,
  • join very similar clusters into one


  • compute typical words of each cluster:
  • i.e. examine most frequent words on the centroid
  • two clusters are similar if there's significant overlap between words of these clusters


  • if there are many documents, the centroid will contain many word => leads to slowdown
  • solution: use projection techniques like PCA


  • Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [1]
  • Cutting, Douglass R., et al. "Scatter/gather: A cluster-based approach to browsing large document collections." 1992. [2]
Share your opinion