K-Means

$K$-Means

This is the most popular clustering algorithm

Lloyd Algorithm

Lloyd algorithm is the most popular way of implementing k-means

Algorithm

First we choose $k$ - the number of clusters we want to get
Then we randomly initialize k cluster centers (cluster centroids)

This is an iterative algorithm, and on each iteration it does 2 things

cluster assignment step
move centroids step

Cluster Assignment Step:

go through each example and choose the closest centroids
and assign the example to it

Move Centroids Step:

Calculate the average for each group
and move the centroids there

Repeat this until converges

Pseudo Code

$k$-means($k$, ${ \mathbf x_i }$):

randomly initialize $k$ cluster centroids $\boldsymbol \mu = \Big( \mu_1, \mu_2, \, … \, , \mu_k \Big) \in \mathbb{R}^{k + 1}$
repeat:

cluster assignment step:

for $i = 1$ to $m$:

$c_i \leftarrow$ closest to $\mathbf x_i$ centroid using Euclidean Distance $\text{dist} = | \mathbf x_i - \boldsymbol \mu_i |^2$

- move centroids step:

for $i = 1$ to $k$:
$\boldsymbol \mu_k \leftarrow$ average of all points assigned to $c_k$

Optimization Objective

Let’s have the following notation

$c_i \in { 1, 2, \ … \ , k }$ - index of cluster to which example $\mathbf x_i$ is assigned
$\boldsymbol \mu_k$ - cluster centroid $k$ ($\boldsymbol \mu_k \in \mathbb{R}^n$)
$\mu_{c_i}$ - cluster centroid of example $\mathbf x_i$

e.g.

$\mathbf x_i$ is assigned to $5$
$c_i = 5$ and
$\mu_{c_i} = \boldsymbol \mu_5$

So optimization objective (cost function, or sometimes called ‘‘distortion’’):

$J(c_1, \ … \ , c_m, \boldsymbol \mu_1, \ … \ , \boldsymbol \mu_k) = \cfrac{1}{m} \sum_i \left| \mathbf x_i - \boldsymbol \mu_{c_i} \right|^2$

we want to find $\min J(c_1, \ … \ , c_m, \boldsymbol \mu_1, \ … \ , \boldsymbol \mu_k)$ with respect to $c_1, \ … \ , c_m, \boldsymbol \mu_1, \ … \ , \boldsymbol \mu_k$

cluster assignment - minimizes $J$
- with $c_1, \ … \ , c_m$
- holding $\boldsymbol \mu_1, \ … \ , \boldsymbol \mu_k$ fixed
move centroids - minimizes $J$
- with $\boldsymbol \mu_1 \ , … \ , \boldsymbol \mu_k$
- holding $c_1, \ … \ , c_m$ fixed

Seed Selection

Seed selection is the process of selecting the initial centroids

Implementation Notes

Random Initialization

How to initialize centroids $\mu = \Big( \boldsymbol \mu_1, \boldsymbol \mu_2, \, … \, , \boldsymbol \mu_k \Big)$?

should have $k < m$
randomly pick $k$ training examples
set $\boldsymbol \mu_1, \ … \ , \boldsymbol \mu_k$ to these $k$ examples

Different clusters

So $k$-means may converge to different clusters depending on how the centroids were initialized
Particularly it may end up in a local optimum - and the split won’t be the best
what we can do is to try it several times and choose the best

’'’Algorithm’’’:

repeat $n$ times (typically 50 - 1000)
- randomly initialize $k$ centroids
- run k-means, get $c_1, \ … \ , c_m, \boldsymbol \mu_1, \ … \ , \boldsymbol \mu_k$
- compute the cost function $J$
pick clustering with lowest cost

If the number of clusters $k$ is 2-10 then the random initialization makes sense, otherwise - probably not

No Data Assigned

If at iteration step we end up with a cluster with no assigned data points, we can:

get rid of it - look for $k-1$ clusters at the next step (advised)
randomly re-initialize that cluster centroid (if you really want $k$ clusters)

Choosing the Number of Clusters

How to choose $k$?

manually - by looking at the data (best)
other methods: e.g. the Elbow Method

Elbow Method

We can plot values of our distortion function for different $k$

at first it goes down rapidly
then goes down slowly
this is called an “elbow”
so in this case we choose $k = 3$ because of the elbow

Domain Knowledge

But often it gives a smooth curve with no visible elbow

In this case you need to use a metrics how well it performs for a particular purpose
Use domain knowledge, if possible, to come up with good $k$

Disadvantages

Quite sensitive to initial seeds - so may need to choose them carefully
For high dimensional data such as documents may be not practical
- centroids may contain lots of words - but we usually want to have sparse centroids
doesn’t perform well on data with outliers or with clusters of different sizes or shapes

Seed Selection

Seed selection procedure is very important

K-means is sensitive to the initial position (which is why in random initialization we run it many times)
especially it’s noticeable in data with high dimensionality

Can try do it smarter than random

e.g. sample and then select seeds using Hierarchical Clustering (like in Scatter/Gather)
if have some partial knowledge about labels - use it (it’ll be so-called Semi-Supervised Clustering)

K-Means++

It’s a smart way of doing seed selection

see http://en.wikipedia.org/wiki/K-means%2B%2B

Variants

=== Weighted K-Means === Objective:

$J(\boldsymbol \mu_1, \dots, \boldsymbol \mu_K) = \cfrac{\sum_{i} w_i \min_k | \mathbf x_i- \boldsymbol \mu_k|^2}{\sum_{i} w_i},$

- $\boldsymbol \mu_i$ is $i$ centroid

$w_i$ is weight assigned to each $\mathbf x_i$

Solution:

Expectation step:

Find the nearest centroid for each data point:

$\forall \ 1 \leqslant k \leqslant K: \quad \mathcal{C}(k) \leftarrow \Big\{ i ~:~ k = \mathrm{arg}\min_k | \mathbf x_i - \boldsymbol \mu_k |^2 \Big\}$

- Minimization step:

Recompute the centroid as a the (weighted) mean of the associated data points:
\[\forall \ 1 \leqslant k \leqslant K: \quad c_k \leftarrow \frac{\sum_{i \in \mathcal{C}(k)} w_i \cdot \mathbf x_i}{\sum_{i \in \mathcal{C}(k)} w_i}\]

until $J$ converges

K-Medoids

Instead of mean, we take the “medoid” of each cluster to represent its centroid

works better for non-euclidean distances than k-means

Bisecting K-Means

This is a variant of K-Means

it’s a Hierarchical Clustering method, and it’s useful for Document Clustering

Algorithm:

start with a single cluster
repeat until have desired number of clusters
- choose a cluster to split (e.g. the largest one)
- find two subclusters using K-means with $k = 2$ and split
- may repeat this procedure several times and take the clusters with highest overall similarity

Scatter/Gather

a special version of k-means for Document Clustering
uses Hierarchical Clustering on a sample to do seed selection

Approximate K-Means

Philbin, James, et al. “Object retrieval with large vocabularies and fast spatial matching.” 2007. [http://research.microsoft.com/pubs/64602/philbin07.pdf]

Mini-Batch K-Means

Lloyd’s classical algorithm is slow for large datasets (Sculley2010)

Use Mini-Batch Gradient Descent for optimizing K-Means
reduces complexity while achieving better solution than Stochastic Gradient Descent

Notation:

$f(C, \mathbf x)$ returns the nearest centroid for $\mathbf x$

Algorithm:

given $k$, batch size $b$, max. number of iterations $t$ and dataset $X$
initialize each $\boldsymbol \mu$ with randomly selected elements from $X$
repeat $t$ times:
$M \leftarrow b$ random examples from $X$
for $\mathbf x \in M$:
- $d[\mathbf x] = f(C, \mathbf x)$ // cache the centroid nearest to $\mathbf x$
for $\mathbf x \in M$:
- $\boldsymbol \mu \leftarrow d[\mathbf x]$
- $v[\boldsymbol \mu] = v[\boldsymbol \mu] + 1$ // counts per centroid
- $\eta = 1 / v[\boldsymbol \mu]$ // per-centroid learning rate
- $\boldsymbol \mu \leftarrow (1 - \eta) \cdot \boldsymbol \mu + \eta \cdot \mathbf x$ //gradient step

Can enforce sparsity by $L_1$ regularization: see Sculley2010

Implementation:

MiniBatchKMeans in scikit-learn

Fuzzy C-Means

Modify the membership function s.t. it outputs the degree of association between item and cluster

degree of membership to the cluster depends on the distance from the document to the cluster centroid

Reference:

Bezdek, James C., Robert Ehrlich, and William Full. “FCM: The fuzzy c-means clustering algorithm.” 1984. [http://web-ext.u-aizu.ac.jp/course/bmclass/documents/FCM%20-%20The%20Fuzzy%20c-Means%20Clustering%20Algorithm.pdf]

Implementation

Usual version:

D = distmat(X, C) calculates the squared distance matrix $D$ between each $x_i \in X$ and each $\mathbf c_k \in C$

def distmat(X, C):    
    X2 = np.sum(X * X, axis=1, keepdims=True)
    C2 = np.sum(C * C, axis=1, keepdims=True)

    XC = np.dot(X, C.T)

    D = X2 - 2 * XC + C2.T
    return D

A = closest(D): returns the closest centroid matrix $A$: with $(A){ik} = 1$ if $w_i \in c_k$ and $(A){ik} = 0$ if $w_i \not \in c_k$

def closest(D):
    D_min = D.min(axis=1, keepdims=True)
    return (D == D_min).astype(int)

C = new_centers(X, A, w) calculates new centroids

def newcenters(X, A):
    summed = np.dot(X, A.T)
    counts = np.sum(A, 1)
    return summed / counts

For weighted k-means it would be

def new_centers(X, A, w):
    W = A * w.reshape(-1, 1)
    weighted_sum = np.dot(W.T, X)
    weights = np.sum(W, axis=0).reshape(-1, 1)
    return weighted_sum / weights

Finally, the cost function for weights:

def J(X, D, w):
    D_min = D.min(axis=1)
    return (w * D_min).sum() / w.sum()

The algorithm:

def kmeans(X, k):
    d, n = X.shape
    M_idx = np.random.choice(np.arange(n), k, replace=False)
    M = X[:, M_idx]
    
    converged = False
    while not converged:
        D = kmu.distmat(M, X)
        A = kmu.closest(D)
        M_new = kmu.newcenters(X, A)

        converged = np.abs(M_new - M).sum() <= 1e-8
        M = M_new    
    return M

With weighted J:

while not converged:
    D = distmat(X, M)
    M = closest(D)
    J_new = J(X, D, w)

    M_new = new_centers(X, A, w)
    converged = np.abs(J_new - J_old) <= 0.01

    M = M_new
    J_old = J_new

See ipython notebook for complete code:

sheet06-kmeans.ipynb

Sources

Machine Learning (coursera)
Python for Machine Learning (TUB)
Machine Learning 1 (TUB)
Steinbach, Michael, George Karypis, and Vipin Kumar. “A comparison of document clustering techniques.” 2000.
Aggarwal, Charu C., and ChengXiang Zhai. “A survey of text clustering algorithms.” Mining Text Data. Springer US, 2012. [http://ir.nmu.org.ua/bitstream/handle/123456789/144935/d1784ebed3eab2708026b202b2b65309.pdf?sequence=1#page=90]
Oikonomakou, Nora, and Michalis Vazirgiannis. “A review of web document clustering approaches.” 2010. [https://scholar.google.com/scholar?cluster=1261203777431390097&hl=ru&as_sdt=0,5]
Sculley, David. “Web-scale k-means clustering.” 2010. [http://www.ra.ethz.ch/CDstore/www2010/www/p1177.pdf]

✏️ Edit on GitHub