# ML Wiki

## $K$-Means

This is the most popular clustering algorithm

## Lloyd Algorithm

Lloyd algorithm is the most popular way of implementing k-means

### Algorithm

• First we choose $k$ - the number of clusters we want to get
• Then we randomly initialize k cluster centers (cluster centroids)

This is an iterative algorithm, and on each iteration it does 2 things

• cluster assignment step
• move centroids step

Cluster Assignment Step:

• go through each example and choose the closest centroids
• and assign the example to it

Move Centroids Step:

• Calculate the average for each group
• and move the centroids there

Repeat this until converges

### Pseudo Code

$k$-means($k$, $\{ \mathbf x_i \}$):

• randomly initialize $k$ cluster centroids $\boldsymbol \mu = \Big( \mu_1, \mu_2, \, ... \, , \mu_k \Big) \in \mathbb{R}^{k + 1}$
• repeat:
• cluster assignment step:
• for $i = 1$ to $m$:
• $c_i \leftarrow$ closest to $\mathbf x_i$ centroid using Euclidean Distance $\text{dist} = \| \mathbf x_i - \boldsymbol \mu_i \|^2$
• move centroids step:
• for $i = 1$ to $k$:
• $\boldsymbol \mu_k \leftarrow$ average of all points assigned to $c_k$

### Optimization Objective

Let's have the following notation

• $c_i \in \{ 1, 2, \ ... \ , k \}$ - index of cluster to which example $\mathbf x_i$ is assigned
• $\boldsymbol \mu_k$ - cluster centroid $k$ ($\boldsymbol \mu_k \in \mathbb{R}^n$)
• $\mu_{c_i}$ - cluster centroid of example $\mathbf x_i$

e.g.

• $\mathbf x_i$ is assigned to $5$
• $c_i = 5$ and
• $\mu_{c_i} = \boldsymbol \mu_5$

So optimization objective (cost function, or sometimes called distortion):

• $J(c_1, \ ... \ , c_m, \boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k) = \cfrac{1}{m} \sum_i \left\| \mathbf x_i - \boldsymbol \mu_{c_i} \right\|^2$

we want to find $\min J(c_1, \ ... \ , c_m, \boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k)$ with respect to $c_1, \ ... \ , c_m, \boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k$

• cluster assignment - minimizes $J$
• with $c_1, \ ... \ , c_m$
• holding $\boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k$ fixed
• move centroids - minimizes $J$
• with $\boldsymbol \mu_1 \ , ... \ , \boldsymbol \mu_k$
• holding $c_1, \ ... \ , c_m$ fixed

### Seed Selection

Seed selection is the process of selecting the initial centroids

## Implementation Notes

### Random Initialization

How to initialize centroids $\mu = \Big( \boldsymbol \mu_1, \boldsymbol \mu_2, \, ... \, , \boldsymbol \mu_k \Big)$?

• should have $k < m$
• randomly pick $k$ training examples
• set $\boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k$ to these $k$ examples

Different clusters

• So $k$-means may converge to different clusters depending on how the centroids were initialized
• Particularly it may end up in a local optimum - and the split won't be the best
• what we can do is to try it several times and choose the best

Algorithm:

• repeat $n$ times (typically 50 - 1000)
• randomly initialize $k$ centroids
• run k-means, get $c_1, \ ... \ , c_m, \boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k$
• compute the cost function $J$
• pick clustering with lowest cost

If the number of clusters $k$ is 2-10 then the random initialization makes sense, otherwise - probably not

### No Data Assigned

If at iteration step we end up with a cluster with no assigned data points, we can:

• get rid of it - look for $k-1$ clusters at the next step (advised)
• randomly re-initialize that cluster centroid (if you really want $k$ clusters)

## Choosing the Number of Clusters

How to choose $k$?

• manually - by looking at the data (best)
• other methods: e.g. the Elbow Method

### Elbow Method

We can plot values of our distortion function for different $k$

• at first it goes down rapidly
• then goes down slowly
• this is called an "elbow"
• so in this case we choose $k = 3$ because of the elbow

### Domain Knowledge

But often it gives a smooth curve with no visible elbow

• In this case you need to use a metrics how well it performs for a particular purpose
• Use domain knowledge, if possible, to come up with good $k$

• Quite sensitive to initial seeds - so may need to choose them carefully
• For high dimensional data such as documents may be not practical
• centroids may contain lots of words - but we usually want to have sparse centroids
• doesn't perform well on data with outliers or with clusters of different sizes or shapes

## Seed Selection

Seed selection procedure is very important

• K-means is sensitive to the initial position (which is why in random initialization we run it many times)
• especially it's noticeable in data with high dimensionality

Can try do it smarter than random

### K-Means++

It's a smart way of doing seed selection

## Variants

### Weighted K-Means

Objective:

• $$J(\boldsymbol \mu_1, \dots, \boldsymbol \mu_K) = \cfrac{\sum_{i} w_i \min_k \| \mathbf x_i- \boldsymbol \mu_k\|^2}{\sum_{i} w_i},$$
• $\boldsymbol \mu_i$ is $i$ centroid
• $w_i$ is weight assigned to each $\mathbf x_i$

Solution:

• Expectation step:
• Find the nearest centroid for each data point:
• $$\forall \ 1 \leqslant k \leqslant K: \quad \mathcal{C}(k) \leftarrow \Big\{ i ~:~ k = \mathrm{arg}\min_k \| \mathbf x_i - \boldsymbol \mu_k \|^2 \Big\}$$
• Minimization step:
• Recompute the centroid as a the (weighted) mean of the associated data points:
• $$\forall \ 1 \leqslant k \leqslant K: \quad c_k \leftarrow \frac{\sum_{i \in \mathcal{C}(k)} w_i \cdot \mathbf x_i}{\sum_{i \in \mathcal{C}(k)} w_i}$$
• until $J$ converges

### K-Medoids

Instead of mean, we take the "medoid" of each cluster to represent its centroid

• works better for non-euclidean distances than k-means

### Bisecting K-Means

This is a variant of K-Means

Algorithm:

• repeat until have desired number of clusters
• choose a cluster to split (e.g. the largest one)
• find two subclusters using K-means with $k = 2$ and split
• may repeat this procedure several times and take the clusters with highest overall similarity

### Approximate K-Means

• Philbin, James, et al. "Object retrieval with large vocabularies and fast spatial matching." 2007. [1]

### Mini-Batch K-Means

Lloyd's classical algorithm is slow for large datasets (Sculley2010)

Notation:

• $f(C, \mathbf x)$ returns the nearest centroid for $\mathbf x$

Algorithm:

• given $k$, batch size $b$, max. number of iterations $t$ and dataset $X$
• initialize each $\boldsymbol \mu$ with randomly selected elements from $X$
• repeat $t$ times:
• $M \leftarrow b$ random examples from $X$
• for $\mathbf x \in M$:
• $d[\mathbf x] = f(C, \mathbf x)$ // cache the centroid nearest to $\mathbf x$
• for $\mathbf x \in M$:
• $\boldsymbol \mu \leftarrow d[\mathbf x]$
• $v[\boldsymbol \mu] = v[\boldsymbol \mu] + 1$ // counts per centroid
• $\eta = 1 / v[\boldsymbol \mu]$ // per-centroid learning rate
• $\boldsymbol \mu \leftarrow (1 - \eta) \cdot \boldsymbol \mu + \eta \cdot \mathbf x$ //gradient step

Can enforce sparsity by $L_1$ regularization: see Sculley2010

Implementation:

### Fuzzy C-Means

Modify the membership function s.t. it outputs the degree of association between item and cluster

• degree of membership to the cluster depends on the distance from the document to the cluster centroid

Reference:

• Bezdek, James C., Robert Ehrlich, and William Full. "FCM: The fuzzy c-means clustering algorithm." 1984. [2]

## Implementation

Usual version:

D = distmat(X, C) calculates the squared distance matrix $D$ between each $x_i \in X$ and each $\mathbf c_k \in C$

def distmat(X, C):
X2 = np.sum(X * X, axis=1, keepdims=True)
C2 = np.sum(C * C, axis=1, keepdims=True)

XC = np.dot(X, C.T)

D = X2 - 2 * XC + C2.T
return D


A = closest(D): returns the closest centroid matrix $A$: with $(A)_{ik} = 1$ if $w_i \in c_k$ and $(A)_{ik} = 0$ if $w_i \not \in c_k$

def closest(D):
D_min = D.min(axis=1, keepdims=True)
return (D == D_min).astype(int)



C = new_centers(X, A, w) calculates new centroids

def newcenters(X, A):
summed = np.dot(X, A.T)
counts = np.sum(A, 1)
return summed / counts


For weighted k-means it would be

def new_centers(X, A, w):
W = A * w.reshape(-1, 1)
weighted_sum = np.dot(W.T, X)
weights = np.sum(W, axis=0).reshape(-1, 1)
return weighted_sum / weights


Finally, the cost function for weights:

def J(X, D, w):
D_min = D.min(axis=1)
return (w * D_min).sum() / w.sum()


The algorithm:

def kmeans(X, k):
d, n = X.shape
M_idx = np.random.choice(np.arange(n), k, replace=False)
M = X[:, M_idx]

converged = False
while not converged:
D = kmu.distmat(M, X)
A = kmu.closest(D)
M_new = kmu.newcenters(X, A)

converged = np.abs(M_new - M).sum() <= 1e-8
M = M_new
return M


With weighted J:

while not converged:
D = distmat(X, M)
M = closest(D)
J_new = J(X, D, w)

M_new = new_centers(X, A, w)
converged = np.abs(J_new - J_old) <= 0.01

M = M_new
J_old = J_new


See ipython notebook for complete code:

## Sources

• Machine Learning (coursera)
• Python for Machine Learning (TUB)
• Machine Learning 1 (TUB)
• Steinbach, Michael, George Karypis, and Vipin Kumar. "A comparison of document clustering techniques." 2000.
• Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [3]
• Oikonomakou, Nora, and Michalis Vazirgiannis. "A review of web document clustering approaches." 2010. [4]
• Sculley, David. "Web-scale k-means clustering." 2010. [5]