$K$-Means
This is the most popular clustering algorithm
Lloyd Algorithm
Lloyd algorithm is the most popular way of implementing k-means
Algorithm
- First we choose $k$ - the number of clusters we want to get
- Then we randomly initialize k cluster centers (cluster centroids)
This is an iterative algorithm, and on each iteration it does 2 things
- cluster assignment step
- move centroids step
Cluster Assignment Step:
- go through each example and choose the closest centroids
- and assign the example to it
Move Centroids Step:
- Calculate the average for each group
- and move the centroids there
Repeat this until converges
Pseudo Code
$k$-means($k$, $\{ \mathbf x_i \}$):
- randomly initialize $k$ cluster centroids $\boldsymbol \mu = \Big( \mu_1, \mu_2, \, ... \, , \mu_k \Big) \in \mathbb{R}^{k + 1}$
- repeat:
- cluster assignment step:
- for $i = 1$ to $m$:
- $c_i \leftarrow$ closest to $\mathbf x_i$ centroid using Euclidean Distance $\text{dist} = \| \mathbf x_i - \boldsymbol \mu_i \|^2$
- move centroids step:
- for $i = 1$ to $k$:
- $\boldsymbol \mu_k \leftarrow$ average of all points assigned to $c_k$
Optimization Objective
Let's have the following notation
- $c_i \in \{ 1, 2, \ ... \ , k \}$ - index of cluster to which example $\mathbf x_i$ is assigned
- $\boldsymbol \mu_k$ - cluster centroid $k$ ($\boldsymbol \mu_k \in \mathbb{R}^n$)
- $\mu_{c_i}$ - cluster centroid of example $\mathbf x_i$
e.g.
- $\mathbf x_i$ is assigned to $5$
- $c_i = 5$ and
- $\mu_{c_i} = \boldsymbol \mu_5$
So optimization objective (cost function, or sometimes called distortion):
- $J(c_1, \ ... \ , c_m, \boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k) = \cfrac{1}{m} \sum_i \left\| \mathbf x_i - \boldsymbol \mu_{c_i} \right\|^2$
we want to find $\min J(c_1, \ ... \ , c_m, \boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k)$ with respect to $c_1, \ ... \ , c_m, \boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k$
- cluster assignment - minimizes $J$
- with $c_1, \ ... \ , c_m$
- holding $\boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k$ fixed
- move centroids - minimizes $J$
- with $\boldsymbol \mu_1 \ , ... \ , \boldsymbol \mu_k$
- holding $c_1, \ ... \ , c_m$ fixed
Seed Selection
Seed selection is the process of selecting the initial centroids
Implementation Notes
Random Initialization
How to initialize centroids $\mu = \Big( \boldsymbol \mu_1, \boldsymbol \mu_2, \, ... \, , \boldsymbol \mu_k \Big)$?
- should have $k < m$
- randomly pick $k$ training examples
- set $\boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k$ to these $k$ examples
Different clusters
- So $k$-means may converge to different clusters depending on how the centroids were initialized
- Particularly it may end up in a local optimum - and the split won't be the best
- what we can do is to try it several times and choose the best
Algorithm:
- repeat $n$ times (typically 50 - 1000)
- randomly initialize $k$ centroids
- run k-means, get $c_1, \ ... \ , c_m, \boldsymbol \mu_1, \ ... \ , \boldsymbol \mu_k$
- compute the cost function $J$
- pick clustering with lowest cost
If the number of clusters $k$ is 2-10 then the random initialization makes sense, otherwise - probably not
No Data Assigned
If at iteration step we end up with a cluster with no assigned data points, we can:
- get rid of it - look for $k-1$ clusters at the next step (advised)
- randomly re-initialize that cluster centroid (if you really want $k$ clusters)
Choosing the Number of Clusters
How to choose $k$?
- manually - by looking at the data (best)
- other methods: e.g. the Elbow Method
Elbow Method
We can plot values of our distortion function for different $k$
-
- at first it goes down rapidly
- then goes down slowly
- this is called an "elbow"
- so in this case we choose $k = 3$ because of the elbow
Domain Knowledge
But often it gives a smooth curve with no visible elbow
- In this case you need to use a metrics how well it performs for a particular purpose
- Use domain knowledge, if possible, to come up with good $k$
Disadvantages
- Quite sensitive to initial seeds - so may need to choose them carefully
- For high dimensional data such as documents may be not practical
- centroids may contain lots of words - but we usually want to have sparse centroids
- doesn't perform well on data with outliers or with clusters of different sizes or shapes
Seed Selection
Seed selection procedure is very important
- K-means is sensitive to the initial position (which is why in random initialization we run it many times)
- especially it's noticeable in data with high dimensionality
Can try do it smarter than random
K-Means++
It's a smart way of doing seed selection
Variants
Objective:
- $$J(\boldsymbol \mu_1, \dots, \boldsymbol \mu_K) = \cfrac{\sum_{i} w_i \min_k \| \mathbf x_i- \boldsymbol \mu_k\|^2}{\sum_{i} w_i},$$
- $\boldsymbol \mu_i$ is $i$ centroid
- $w_i$ is weight assigned to each $\mathbf x_i$
Solution:
- Expectation step:
- Find the nearest centroid for each data point:
- $$\forall \ 1 \leqslant k \leqslant K: \quad \mathcal{C}(k) \leftarrow \Big\{ i ~:~ k = \mathrm{arg}\min_k \| \mathbf x_i - \boldsymbol \mu_k \|^2 \Big\}$$
- Minimization step:
- Recompute the centroid as a the (weighted) mean of the associated data points:
- $$\forall \ 1 \leqslant k \leqslant K: \quad c_k \leftarrow \frac{\sum_{i \in \mathcal{C}(k)} w_i \cdot \mathbf x_i}{\sum_{i \in \mathcal{C}(k)} w_i}$$
- until $J$ converges
Instead of mean, we take the "medoid" of each cluster to represent its centroid
- works better for non-euclidean distances than k-means
Bisecting K-Means
This is a variant of K-Means
Algorithm:
- start with a single cluster
- repeat until have desired number of clusters
- choose a cluster to split (e.g. the largest one)
- find two subclusters using K-means with $k = 2$ and split
- may repeat this procedure several times and take the clusters with highest overall similarity
Approximate K-Means
- Philbin, James, et al. "Object retrieval with large vocabularies and fast spatial matching." 2007. [1]
Mini-Batch K-Means
Lloyd's classical algorithm is slow for large datasets (Sculley2010)
Notation:
- $f(C, \mathbf x)$ returns the nearest centroid for $\mathbf x$
Algorithm:
- given $k$, batch size $b$, max. number of iterations $t$ and dataset $X$
- initialize each $\boldsymbol \mu$ with randomly selected elements from $X$
- repeat $t$ times:
- $M \leftarrow b$ random examples from $X$
- for $\mathbf x \in M$:
- $d[\mathbf x] = f(C, \mathbf x)$ // cache the centroid nearest to $\mathbf x$
- for $\mathbf x \in M$:
- $\boldsymbol \mu \leftarrow d[\mathbf x]$
- $v[\boldsymbol \mu] = v[\boldsymbol \mu] + 1$ // counts per centroid
- $\eta = 1 / v[\boldsymbol \mu]$ // per-centroid learning rate
- $\boldsymbol \mu \leftarrow (1 - \eta) \cdot \boldsymbol \mu + \eta \cdot \mathbf x$ //gradient step
Can enforce sparsity by $L_1$ regularization: see Sculley2010
Implementation:
Modify the membership function s.t. it outputs the degree of association between item and cluster
- degree of membership to the cluster depends on the distance from the document to the cluster centroid
Reference:
- Bezdek, James C., Robert Ehrlich, and William Full. "FCM: The fuzzy c-means clustering algorithm." 1984. [2]
Implementation
Usual version:
D = distmat(X, C)
calculates the squared distance matrix $D$ between each $x_i \in X$ and each $\mathbf c_k \in C$
def distmat(X, C):
X2 = np.sum(X * X, axis=1, keepdims=True)
C2 = np.sum(C * C, axis=1, keepdims=True)
XC = np.dot(X, C.T)
D = X2 - 2 * XC + C2.T
return D
A = closest(D)
: returns the closest centroid matrix $A$: with $(A)_{ik} = 1$ if $w_i \in c_k$ and $(A)_{ik} = 0$ if $w_i \not \in c_k$
def closest(D):
D_min = D.min(axis=1, keepdims=True)
return (D == D_min).astype(int)
C = new_centers(X, A, w)
calculates new centroids
def newcenters(X, A):
summed = np.dot(X, A.T)
counts = np.sum(A, 1)
return summed / counts
For weighted k-means it would be
def new_centers(X, A, w):
W = A * w.reshape(-1, 1)
weighted_sum = np.dot(W.T, X)
weights = np.sum(W, axis=0).reshape(-1, 1)
return weighted_sum / weights
Finally, the cost function for weights:
def J(X, D, w):
D_min = D.min(axis=1)
return (w * D_min).sum() / w.sum()
The algorithm:
def kmeans(X, k):
d, n = X.shape
M_idx = np.random.choice(np.arange(n), k, replace=False)
M = X[:, M_idx]
converged = False
while not converged:
D = kmu.distmat(M, X)
A = kmu.closest(D)
M_new = kmu.newcenters(X, A)
converged = np.abs(M_new - M).sum() <= 1e-8
M = M_new
return M
With weighted J
:
while not converged:
D = distmat(X, M)
M = closest(D)
J_new = J(X, D, w)
M_new = new_centers(X, A, w)
converged = np.abs(J_new - J_old) <= 0.01
M = M_new
J_old = J_new
See ipython notebook for complete code:
Sources
- Machine Learning (coursera)
- Python for Machine Learning (TUB)
- Machine Learning 1 (TUB)
- Steinbach, Michael, George Karypis, and Vipin Kumar. "A comparison of document clustering techniques." 2000.
- Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [3]
- Oikonomakou, Nora, and Michalis Vazirgiannis. "A review of web document clustering approaches." 2010. [4]
- Sculley, David. "Web-scale k-means clustering." 2010. [5]