This is the most popular clustering algorithm

Lloyd algorithm is the most popular way of implementing k-means

- First we choose $k$ - the number of clusters we want to get
- Then we randomly initialize k cluster centers (cluster centroids)

This is an iterative algorithm, and on each iteration it does 2 things

- cluster assignment step
- move centroids step

Cluster Assignment Step:

- go through each example and choose the closest centroids
- and assign the example to it

Move Centroids Step:

- Calculate the average for each group
- and move the centroids there

Repeat this until converges

$k$-means($k$, $\{ \mathbf x_i \}$):

- randomly initialize $k$ cluster centroids $\boldsymbol \mu = \Big( \mu_1, \mu_2, \, ... \, , \mu_k \Big) \in \mathbb{R}^{k + 1}$
- repeat:
- cluster assignment step:
- for $i = 1$ to $m$:
- $c_i \leftarrow$ closest to $\mathbf x_i$ centroid using Euclidean Distance $\text{dist} = \| \mathbf x_i - \boldsymbol \mu_i \|^2$

- move centroids step:
- for $i = 1$ to $k$:
- $\boldsymbol \mu_k \leftarrow$ average of all points assigned to $c_k$

Let's have the following notation

- $c_i \in \{ 1, 2, \ ... \ , k \}$ - index of cluster to which example $\mathbf x_i$ is assigned
- $\boldsymbol \mu_k$ - cluster centroid $k$ ($\boldsymbol \mu_k \in \mathbb{R}^n$)
- $\mu_{c_i}$ - cluster centroid of example $\mathbf x_i$

e.g.

- $\mathbf x_i$ is assigned to $5$
- $c_i = 5$ and
- $\mu_{c_i} = \boldsymbol \mu_5$

So optimization objective (cost function, or sometimes called *distortion*):

- $J(c_1, \ ... \ , c_m, \boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k) = \cfrac{1}{m} \sum_i \left\| \mathbf x_i - \boldsymbol \mu_{c_i} \right\|^2$

we want to find $\min J(c_1, \ ... \ , c_m, \boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k)$ with respect to $c_1, \ ... \ , c_m, \boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k$

- cluster assignment - minimizes $J$
- with $c_1, \ ... \ , c_m$
- holding $\boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k$ fixed

- move centroids - minimizes $J$
- with $\boldsymbol \mu_1 \ , ... \ , \boldsymbol \mu_k$
- holding $c_1, \ ... \ , c_m$ fixed

Seed selection is the process of selecting the initial centroids

How to initialize centroids $\mu = \Big( \boldsymbol \mu_1, \boldsymbol \mu_2, \, ... \, , \boldsymbol \mu_k \Big)$?

- should have $k < m$
- randomly pick $k$ training examples
- set $\boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k$ to these $k$ examples

Different clusters

- So $k$-means may converge to different clusters depending on how the centroids were initialized
- Particularly it may end up in a local optimum - and the split won't be the best
- what we can do is to try it several times and choose the best

**Algorithm**:

- repeat $n$ times (typically 50 - 1000)
- randomly initialize $k$ centroids
- run k-means, get $c_1, \ ... \ , c_m, \boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k$
- compute the cost function $J$

- pick clustering with lowest cost

If the number of clusters $k$ is 2-10 then the random initialization makes sense, otherwise - probably not

If at iteration step we end up with a cluster with no assigned data points, we can:

- get rid of it - look for $k-1$ clusters at the next step (advised)
- randomly re-initialize that cluster centroid (if you really want $k$ clusters)

How to choose $k$?

- manually - by looking at the data (best)
- other methods: e.g. the Elbow Method

We can plot values of our distortion function for different $k$

- at first it goes down rapidly
- then goes down slowly
- this is called an "elbow"
- so in this case we choose $k = 3$ because of the elbow

But often it gives a smooth curve with no visible elbow

- In this case you need to use a metrics how well it performs for a particular purpose
- Use domain knowledge, if possible, to come up with good $k$

- Quite sensitive to initial seeds - so may need to choose them carefully
- For high dimensional data such as documents may be not practical
- centroids may contain lots of words - but we usually want to have sparse centroids

- doesn't perform well on data with outliers or with clusters of different sizes or shapes

Seed selection procedure is very important

- K-means is sensitive to the initial position (which is why in random initialization we run it many times)
- especially it's noticeable in data with high dimensionality

Can try do it smarter than random

- e.g. sample and then select seeds using Hierarchical Clustering (like in Scatter/Gather)
- if have some partial knowledge about labels - use it (it'll be so-called Semi-Supervised Clustering)

It's a smart way of doing seed selection

Objective:

- $$J(\boldsymbol \mu_1, \dots, \boldsymbol \mu_K) = \cfrac{\sum_{i} w_i \min_k \| \mathbf x_i- \boldsymbol \mu_k\|^2}{\sum_{i} w_i},$$
- $\boldsymbol \mu_i$ is $i$ centroid
- $w_i$ is weight assigned to each $\mathbf x_i$

Solution:

- Expectation step:
- Find the nearest centroid for each data point:
- $$\forall \ 1 \leqslant k \leqslant K: \quad \mathcal{C}(k) \leftarrow \Big\{ i ~:~ k = \mathrm{arg}\min_k \| \mathbf x_i - \boldsymbol \mu_k \|^2 \Big\}$$

- Minimization step:
- Recompute the centroid as a the (weighted) mean of the associated data points:
- $$\forall \ 1 \leqslant k \leqslant K: \quad c_k \leftarrow \frac{\sum_{i \in \mathcal{C}(k)} w_i \cdot \mathbf x_i}{\sum_{i \in \mathcal{C}(k)} w_i}$$

- until $J$ converges

Instead of mean, we take the "medoid" of each cluster to represent its centroid

- works better for non-euclidean distances than k-means

This is a variant of K-Means

- it's a Hierarchical Clustering method, and it's useful for Document Clustering

Algorithm:

- start with a single cluster
- repeat until have desired number of clusters
- choose a cluster to split (e.g. the largest one)
- find two subclusters using K-means with $k = 2$ and split
- may repeat this procedure several times and take the clusters with highest overall similarity

- a special version of k-means for Document Clustering
- uses Hierarchical Clustering on a sample to do seed selection

- Philbin, James, et al. "Object retrieval with large vocabularies and fast spatial matching." 2007. [1]

Lloyd's classical algorithm is slow for large datasets (Sculley2010)

- Use Mini-Batch Gradient Descent for optimizing K-Means
- reduces complexity while achieving better solution than Stochastic Gradient Descent

Notation:

- $f(C, \mathbf x)$ returns the nearest centroid for $\mathbf x$

Algorithm:

- given $k$, batch size $b$, max. number of iterations $t$ and dataset $X$
- initialize each $\boldsymbol \mu$ with randomly selected elements from $X$
- repeat $t$ times:
- $M \leftarrow b$ random examples from $X$
- for $\mathbf x \in M$:
- $d[\mathbf x] = f(C, \mathbf x)$ // cache the centroid nearest to $\mathbf x$

- for $\mathbf x \in M$:
- $\boldsymbol \mu \leftarrow d[\mathbf x]$
- $v[\boldsymbol \mu] = v[\boldsymbol \mu] + 1$ // counts per centroid
- $\eta = 1 / v[\boldsymbol \mu]$ // per-centroid learning rate
- $\boldsymbol \mu \leftarrow (1 - \eta) \cdot \boldsymbol \mu + \eta \cdot \mathbf x$ //gradient step

Can enforce sparsity by $L_1$ regularization: see Sculley2010

Implementation:

- MiniBatchKMeans in scikit-learn

Modify the membership function s.t. it outputs the degree of association between item and cluster

- degree of membership to the cluster depends on the distance from the document to the cluster centroid

Reference:

- Bezdek, James C., Robert Ehrlich, and William Full. "FCM: The fuzzy c-means clustering algorithm." 1984. [2]

Usual version:

`D = distmat(X, C)`

calculates the squared distance matrix $D$ between each $x_i \in X$ and each $\mathbf c_k \in C$

def distmat(X, C): X2 = np.sum(X * X, axis=1, keepdims=True) C2 = np.sum(C * C, axis=1, keepdims=True) XC = np.dot(X, C.T) D = X2 - 2 * XC + C2.T return D

`A = closest(D)`

: returns the closest centroid matrix $A$: with $(A)_{ik} = 1$ if $w_i \in c_k$ and $(A)_{ik} = 0$ if $w_i \not \in c_k$

def closest(D): D_min = D.min(axis=1, keepdims=True) return (D == D_min).astype(int)

`C = new_centers(X, A, w)`

calculates new centroids

def newcenters(X, A): summed = np.dot(X, A.T) counts = np.sum(A, 1) return summed / counts

For weighted k-means it would be

def new_centers(X, A, w): W = A * w.reshape(-1, 1) weighted_sum = np.dot(W.T, X) weights = np.sum(W, axis=0).reshape(-1, 1) return weighted_sum / weights

Finally, the cost function for weights:

def J(X, D, w): D_min = D.min(axis=1) return (w * D_min).sum() / w.sum()

The algorithm:

def kmeans(X, k): d, n = X.shape M_idx = np.random.choice(np.arange(n), k, replace=False) M = X[:, M_idx] converged = False while not converged: D = kmu.distmat(M, X) A = kmu.closest(D) M_new = kmu.newcenters(X, A) converged = np.abs(M_new - M).sum() <= 1e-8 M = M_new return M

With weighted `J`

:

while not converged: D = distmat(X, M) M = closest(D) J_new = J(X, D, w) M_new = new_centers(X, A, w) converged = np.abs(J_new - J_old) <= 0.01 M = M_new J_old = J_new

See ipython notebook for complete code:

- Machine Learning (coursera)
- Python for Machine Learning (TUB)
- Machine Learning 1 (TUB)
- Steinbach, Michael, George Karypis, and Vipin Kumar. "A comparison of document clustering techniques." 2000.
- Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [3]
- Oikonomakou, Nora, and Michalis Vazirgiannis. "A review of web document clustering approaches." 2010. [4]
- Sculley, David. "Web-scale k-means clustering." 2010. [5]