Non-Negative Matrix Factorization (NMF) is a Matrix Decomposition technique that is especially good for Cluster Analysis

See Lee2001

- Norm Minimization (NMF-ED)
- KL Divergence Minimization (NMF-KL)

Let $A$ be an $m \times n$ matrix

- we want to find $k$-rank approximation of $A$ (or, in other words, create $k$ clusters)

to do this, we want to find matrices $U$ and $V$ s.t.

- $J = \cfrac{1}{2} \| A - U V^T \|_F$ where the norm is the Frobenius Norm
- $U$ is $m \times k$ matrix
- $V$ is $n \times k$ matrix
- columns of $V$ are the basis

- as we minimize $J$ we get $UV^T \approx A$
- if $\mathbf a$ is a row of $A$, then $\mathbf a \approx \mathbf u V^T$ where $u$ is a corresponding row of $U$
- so $\mathbf a$ is rewritten using the basis from the columns of $V$

This can be solved analytically

- for any matrix $Q$, $\| Q \|_F = \text{tr}(QQ^T)$, where $\text{tr}(QQ^T)$ is the Trace of $QQ^T$ (see properties of Frobenius Norm)
- so $J = 0.5 \text{tr}\Big( (A - UV^T)^T \, (A - UV^T) \Big) = \text{tr}\Big( AA^T - 2 A U V^T + U V^T V U^T \Big) = \frac{1}{2} \, \text{tr} (AA^T) - \text{tr}(A U V^T) + 0.5 \text{tr}(U V^T V U^T)$
- additionally, we have constraints $u_{ij} \geqslant 0$, $v_{ij} \geqslant 0$: use Lagrange Multipliers for this
- let $\boldsymbol \alpha$ be a matrix of $\alpha_{ij}$ of the same dimension as $U$ and $\boldsymbol \beta$ be a batrix of $\beta_{ij}$ of the same dimension as $V$
- note that $\text{tr}(\boldsymbol \alpha \, U^T) = \sum_{ij} \alpha_{ij} u_{ij}$ and $\text{tr}(\boldsymbol \alpha \, U^T) = \sum_{ij} \alpha_{ij} u_{ij}$ - so we'll use them as lagrangian constraints

Let's solve it:

- $L = J + \text{tr}(\boldsymbol \alpha U^T) + \text{tr}(\boldsymbol \beta V^T)$
- $\cfrac{\partial L}{\partial U} = -AV + U V^T V + \boldsymbol \alpha = 0$
- $\cfrac{\partial L}{\partial V} = -A^T V + V U^T U + \boldsymbol \beta = 0$

The KTT Condition $\alpha_{ij} \, u_{ij} = 0$ and $\beta_{ij} \, v_{ij} = 0$:

- $(AV)_{ij} \cdot u_{ij} - (U V^T V)_{ij} \cdot u_{ij} = 0$
- $(A^T U)_{ij} \cdot v_{ij} - (V U^T U)_{ij} \cdot v_{ij} = 0$
- these are independent of $\boldsymbol \alpha$ and $\boldsymbol \beta$

So we have the following update rule:

- $u_{ij} = \cfrac{(AV)_{ij} \cdot u_{ij}}{(U V^T V)_{ij}}$
- $v_{ij} = \cfrac{(A^T U)_{ij} \cdot v_{ij}}{(V U^T U)_{ij}}$

- SVD: basis is orthonormal, NMF is not
- components of NMF are always non-negative
- vectors in the basis of NMF directly correspond to clusters
- we can learn cluster membership by taking the largest component of the vector (in the new representation)

Why SVD is bad for LSA?

- if want to apply Cluster Analysis to cluster documents in the semantic space produced by SVD, need to use an external algorithm, e.g. K-Means
- negative values are hard to interpret
- objective of SVD is to find the orthogonal basis, which doesn't always result in a good basis for the Semantic Space:
- (source: Xu2003)

Can use NMF instead of SVD to reveal the latent structure and find the Semantic Space

- if $k$ is small (compared to original $n$), then NMF is bound to discover latent structure (i.e. "topics") in data
- non-negativity is good for interpretability:
- it ensures that documents can be interpreted as non-negative combination of the key concepts (topics/clusters)
- it can find clusters for both terms and documents: columns of $V$ are basis for docs, columns of $U$ are basis for terms

Assumptions:

- there are clusters of documents that correspond to coherent topics
- suppose we have a document-term matrix $D$
- and want to project documents (rows of $D$) to a $k$-dimensional semantic space
- i.e. we want to represent each document as a linear (non-negative) combination of $k$ topics

Result of NMF can be interpreted as clustering directly:

- decide on cluster membership by finding the base topic with greatest projection value
- can also be fuzzy: e.g. take all topics larger than some projection value

It's a variant of NMF (Li2001)

- in usual NMF vectors are not orthogonal and may be more similar to each other than desired
- LNMF addresses this issue

3 additional constants on $U$ and $V$ that aim to find the local features in the original matrix

- max sparsity in $V$: want to have as many 0's as possible
- expressiveness of $U$: retain only those components of $U$ that carry most information about the matrix
- max orthogonality of $U$

Results reported by Osinski2006:

- slow convergence
- not good for Information Retrieval clustering because of sparsity

NMF-ED = Euclidean distance minimization

- Algorithm from Lee1999

import numpy as np def seung_objective(V, W, H): ''' calculated the non-negative matrix factorization objective Usage: W, H = seung_update(V, W, H) Parameters: V: a (d x n)-array containing n observations in the columns W: (d x k)-array of non-negative basis images (components) H: (k x n)-array of weights, one column for each of the n observations Returns: F: a scalar objective ''' d, n = V.shape WH = np.dot(W, H) F = (V * np.log(WH) - WH).sum() / (d * n) return F def seung_updateW(V, W, H): ''' performs the multiplicative non-negative matrix factorization updates for W Usage: W, H = seung_update(V, W, H) Parameters: V: a (d x n)-array containing n observations in the columns W: (d x k)-array of non-negative basis images (components) H: (k x n)-array of weights, one column for each of the n observations Returns: W: (d x k)-array of updated non-negative basis images (components) ''' WH = np.dot(W, H) W_new = W * np.dot(V / WH, H.T) W_new = W_new / np.sum(W_new, axis=0, keepdims=True) return W_new def seung_updateH(V, W, H): ''' performs the multiplicative non-negative matrix factorization updates Usage: W, H = seung_update(V, W, H) Parameters: V: a (d x n)-array containing n observations in the columns W: (d x k)-array of non-negative basis images (components) H: (k x n)-array of weights, one column for each of the n observations Returns: H: (k x n)-array of updated weights, one column for each of the n observations ''' WH = np.dot(W, H) H_new = H * np.dot((V / WH).T, W).T return H_new def seung_nmf(V, k, threshold=1e-5, maxiter=500): ''' decomposes X into r components by non-negative matrix factorization Usage: W, H = seung_nmf(X, r) Parameters: V: a (d x n)-array containing n observations in the columns k: number of components to extract threshold: relative error threshold of the iteration maxiter: maximum number of iterations Returns: W: (d x k)-array of non-negative basis images (components) H: (k x n)-array of weights, one column for each of the n observations ''' d, n = V.shape W = np.random.rand(d, k) H = np.random.rand(k, n) F = seung_objective(V, W, H) it_no = 0 converged = False while (not converged) and it_no <= maxiter: W_new = seung_updateW(V, W, H) H_new = seung_updateH(V, W_new, H) F_new = seung_objective(V, W_new, H_new) converged = np.abs(F_new - F) <= threshold W, H = W_new, H_new it_no = it_no + 1 return W, H

- Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." 1999. [1]
- Lee, Daniel D., and H. Sebastian Seung. "Algorithms for non-negative matrix factorization." 2001. [2]
- Li, Stan Z., et al. "Learning spatially localized, parts-based representation." 2001. [3]

- Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [4]
- Xu, Wei, Xin Liu, and Yihong Gong. "Document clustering based on non-negative matrix factorization." 2003. [5]
- Osinski, Stanislaw. "Improving quality of search results clustering with approximate matrix factorisations." 2006. [6]
- Python for Machine Learning (TUB)