Agglomerative Clustering

General concept: merge items into clusters based on distance/similarity

  • usually based on best pairwise similarity

Typical steps:

  • at the beginning each document is a cluster on its own
  • then we compute similarity between all pairs of clusters and store the results in a similarity matrix
  • merge two most similar clusters
  • update the similarity matrix
  • repeat until everything belongs to the same cluster


Linkage Types

How to join two clusters?

  • Single Linkage (SLINK)
  • Complete Linkage (CLINK)
  • Group-Average Linkage


Suppose we have clusters $A, B, C, ...$ that we want to merge

TODO: need more mathematical definitions


Single Linkage

Merge two groups $A$ and $B$ based on their closest pair


Implementation:

  • compute all similarity pairs
  • sort them in order of decrease
  • process pairs in this order


advantage:

  • efficient to implement
  • equivalent to a Spanning Tree algo on the complete graph of pair-wise distances TODO: Link to Algo 2 from Coursera!
  • can use Prim's Spanning Tree algo


Drawbacks

  • encourages chaining
    • similarity is usually not transitive:
    • i.e. if $A$ is similar to $B$, and $B$ is similar to $C$, it doesn't mean that $A$ must be similar to $C$
    • but single linkage encourages grouping through transitivity chains


References:

  • Sibson, Robin. "SLINK: an optimally efficient algorithm for the single-link cluster method." 1973. [1]


Complete Linkage

Worst-case similarity:

  • avoids chaining altogether
  • but it's very expensive computationally


References:

  • Defays, Daniel. "An efficient algorithm for a complete link method." 1977. [2]


Group-Average Linkage

similarity between groups $A$ and $B$ are calculated as average similarity between each $a \in A$ and $b \in B$


  • It's way slower than single linkage,
  • but it's more robust: it doesn't show the chaining behavior


Speeding up:

  • can approximate it by using the distance between centroids: mean doc in $A$ and mean doc in $B$


Ward's Method

Merge the pair of clusters that minimizes the total within-group error (sum of squares) between each document and centroid


Result:

  • spherical tightly bound clusters
  • less sensitive to outliers

References:

  • El-Hamdouchi, Abdelmoula, and Peter Willett. "Hierarchic document classification using Ward's clustering method." 1986. [3]


Pros and Cons

  • Single-link algorithms are best for capturing clusters of different sizes and shapes
  • but it's also sensitive to noise
  • complete link and group average are not affected by noise, but have a bias towards finding global patterns

Computational complexity:

  • only Single-Link is computationally possible for large datasets, but it doesn't give good results because uses too little information


References

  • Rasmussen, Edie M. "Clustering Algorithms." Information retrieval: data structures & algorithms 1992. [4]
  • Voorhees, Ellen M. "Implementing agglomerative hierarchic clustering algorithms for use in document retrieval." 1986. [5]


Sources

  • Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [6]
  • Oikonomakou, Nora, and Michalis Vazirgiannis. "A review of web document clustering approaches." Data mining and knowledge discovery handbook. 2010. [7]