This is a stub. Edit it. |

General concept: merge items into clusters based on distance/similarity

- usually based on best pairwise similarity

Typical steps:

- at the beginning each document is a cluster on its own
- then we compute similarity between all pairs of clusters and store the results in a similarity matrix
- merge two most similar clusters
- update the similarity matrix
- repeat until everything belongs to the same cluster

How to join two clusters?

- Single Linkage (SLINK)
- Complete Linkage (CLINK)
- Group-Average Linkage

Suppose we have clusters $A, B, C, ...$ that we want to merge

**TODO: need more mathematical definitions **

Merge two groups $A$ and $B$ based on their closest pair

Implementation:

- compute all similarity pairs
- sort them in order of decrease
- process pairs in this order

advantage:

- efficient to implement
- equivalent to a Spanning Tree algo on the complete graph of pair-wise distances
**TODO: Link to Algo 2 from Coursera!** - can use Prim's Spanning Tree algo

Drawbacks

- encourages chaining
- similarity is usually not transitive:
- i.e. if $A$ is similar to $B$, and $B$ is similar to $C$, it doesn't mean that $A$ must be similar to $C$
- but single linkage encourages grouping through transitivity chains

References:

- Sibson, Robin. "SLINK: an optimally efficient algorithm for the single-link cluster method." 1973. [1]

Worst-case similarity:

- avoids chaining altogether
- but it's very expensive computationally

References:

- Defays, Daniel. "An efficient algorithm for a complete link method." 1977. [2]

similarity between groups $A$ and $B$ are calculated as average similarity between each $a \in A$ and $b \in B$

- It's way slower than single linkage,
- but it's more robust: it doesn't show the chaining behavior

Speeding up:

- can approximate it by using the distance between centroids: mean doc in $A$ and mean doc in $B$

Merge the pair of clusters that minimizes the total within-group error (sum of squares) between each document and centroid

Result:

- spherical tightly bound clusters
- less sensitive to outliers

References:

- El-Hamdouchi, Abdelmoula, and Peter Willett. "Hierarchic document classification using Ward's clustering method." 1986. [3]

- Single-link algorithms are best for capturing clusters of different sizes and shapes
- but it's also sensitive to noise
- complete link and group average are not affected by noise, but have a bias towards finding global patterns

Computational complexity:

- only Single-Link is computationally possible for large datasets, but it doesn't give good results because uses too little information

- Rasmussen, Edie M. "Clustering Algorithms." Information retrieval: data structures & algorithms 1992. [4]
- Voorhees, Ellen M. "Implementing agglomerative hierarchic clustering algorithms for use in document retrieval." 1986. [5]

- Ertöz, Levent, Michael Steinbach, and Vipin Kumar. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [6]
- Oikonomakou, Nora, and Michalis Vazirgiannis. "A review of web document clustering approaches." Data mining and knowledge discovery handbook. 2010. [7]