Document Clustering

The goal of text clustering is

Objects to be clustered are



Usual NLP/IR

Document Representation

The most commonly used document representation is Vector Space Model:

  • extract a list of unique terms and weight them
  • weighted with TF or TF-IDF

Alternative representation:

Feature Selection

Concept of distance and similarity may be not meaningful in high-dimensional space

  • so may need to reduce dimensionality

In text mining usually referred as "Term Selection":

  • Remove stop words
  • use document frequency to cut away infrequent and very frequent words. such words usually don't contribute much (or anything) to similarity computation
  • Subset Selection and Feature Filtering won't work because we don't have labels

Dimensionality Reduction


Distances and Similarity

Popular choice:

If not Vector Space Models:


  • Strehl2000: Survey on distances for documents
  • Sahami2006: When text segments are too short (e.g. tweets or sentences)

Direct similarity measures are not always reliable for high-dimensional clustering (see Guha1999)

  • high dimensional data is sparse and therefore on average similarity is low
  • also see Curse of Dimensionality
  • SNN Distance solves it: Shared Nearest Neighbors Distance, # of KNNs two documents share (as used in SNN Clustering)



  • centroids = weighed average of all docs in the cluster
  • to compare a document with a cluster, calculate cosine between document and cluster

A variation of K-Means:

  • Bisecting K-Means: gives good performance for document clusters
  • K-Medoids for non-Euclidean distances, using medoid ($\approx$ median) instead of mean for selecting a centroid
  • Scatter/Gather:
    • smart seed selection
    • centroid = concatenation of all docs in the cluster
    • Split and Join refinement operations

Two-Phase Document Clustering

Main idea:

  • use Mutual Information to find best term clustering
  • and then use mutual information to find best document clustering


Clustering terms and documents at the same time

  • clustering of terms and clustering of documents are dual problems
  • take advantage of that
  • also can use Non-Negative Matrix Factorization $A \approx UV^T$ where $U$ are clusters of docs and $V$ are clusters of terms

Latent Semantic Analysis

Using PCA define new features from terms

  • it creates a new semantic space where problems like symomymy or polysemy are solved
  • term-document matrix is decomposed using SVD

Not only SVD is good:

Topic Models

Semi-Supervised Clustering

Use prior knowledge to help clustering

  • e.g. if you know some of the labels, do better seed selection for K-Means



  • Text data usually has very high dimensionality
  • especially important for large corpus - it will be very slow, especially for hierarchical algorithms

Inverted Index


  • usually a document contains only a small portion of terms
  • so document vectors are very sparse
  • typical distance is cosine similarity - it ignores zeros. for cosine to be non-zero, two docs need to share at least one term
  • $D^T$ is the inverted index of the term-document matrix $D$

this, to find docs similar to $d$:

  • for each $w_i \in d$
  • let $D_i = \{ d_i \} - d = \text{index}[w_i]$ be a set of documents that also contain $w_i$
  • then take the union of all $D_i$
  • calculate similarity only with documents from this union

Term Selection


  • Only top high-weighted terms contribute substantially to the norm
  • so keep only those weights that contribute 90% of the norm
  • and set the rest to 0

It can reduce the number of documents to consider but without losing much information

Locality Sensitive Hashing

The approach with inverted index is also called "self-join", and this doesn't scale well for large databases

  • LSH allows to circumvent the self-join bottleneck


  • Anick, Peter G., and Shivakumar Vaithyanathan. "Exploiting clustering and phrases for context-based information retrieval." 1997.
  • Cutting, Douglass R., et al. "Scatter/gather: A cluster-based approach to browsing large document collections." 1992. [1]
  • Cutting, Douglass R., David R. Karger, and Jan O. Pedersen. "Constant interaction-time scatter/gather browsing of very large document collections." 1993.
  • Baker, L. Douglas, and Andrew Kachites McCallum. "Distributional clustering of words for text classification." 1998. [2]
  • Bekkerman, Ron, et al. "On feature distributional clustering for text categorization." 2001. [3]
  • Sahami, Mehran, and Timothy D. Heilman. "A web-based kernel function for measuring the similarity of short text snippets." 2006. [4]
  • Strehl, Alexander, Joydeep Ghosh, and Raymond Mooney. "Impact of similarity measures on web-page clustering." 2000. [5]
  • Guha, Sudipto, Rajeev Rastogi, and Kyuseok Shim. "ROCK: A robust clustering algorithm for categorical attributes." 1999. [6]


  • Steinbach, Michael, George Karypis, and Vipin Kumar. "A comparison of document clustering techniques." 2000. ([7])
  • Larsen, Bjornar, and Chinatsu Aone. "Fast and effective text mining using linear-time document clustering." 1999. ([8])
  • Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [9]
  • Jing, Liping. "Survey of text clustering." (2008). [10]
  • Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [11]