Non-Negative Matrix Factorization

Non-Negative Matrix Factorization (NMF) is a Matrix Decomposition technique that is especially good for Cluster Analysis


Algorithms for Computing NMF

See Lee2001


Norm Minimization Algorithm

Let $A$ be an $m \times n$ matrix

  • we want to find $k$-rank approximation of $A$ (or, in other words, create $k$ clusters)


to do this, we want to find matrices $U$ and $V$ s.t.

  • $J = \cfrac{1}{2} \| A - U V^T \|_F$ where the norm is the Frobenius Norm
  • $U$ is $m \times k$ matrix
  • $V$ is $n \times k$ matrix
  • columns of $V$ are the basis
  • as we minimize $J$ we get $UV^T \approx A$
  • if $\mathbf a$ is a row of $A$, then $\mathbf a \approx \mathbf u V^T$ where $u$ is a corresponding row of $U$
  • so $\mathbf a$ is rewritten using the basis from the columns of $V$


This can be solved analytically

  • for any matrix $Q$, $\| Q \|_F = \text{tr}(QQ^T)$, where $\text{tr}(QQ^T)$ is the Trace of $QQ^T$ (see properties of Frobenius Norm)
  • so $J = 0.5 \text{tr}\Big( (A - UV^T)^T \, (A - UV^T) \Big) = \text{tr}\Big( AA^T - 2 A U V^T + U V^T V U^T \Big) = \frac{1}{2} \, \text{tr} (AA^T) - \text{tr}(A U V^T) + 0.5 \text{tr}(U V^T V U^T)$
  • additionally, we have constraints $u_{ij} \geqslant 0$, $v_{ij} \geqslant 0$: use Lagrange Multipliers for this
  • let $\boldsymbol \alpha$ be a matrix of $\alpha_{ij}$ of the same dimension as $U$ and $\boldsymbol \beta$ be a batrix of $\beta_{ij}$ of the same dimension as $V$
  • note that $\text{tr}(\boldsymbol \alpha \, U^T) = \sum_{ij} \alpha_{ij} u_{ij}$ and $\text{tr}(\boldsymbol \alpha \, U^T) = \sum_{ij} \alpha_{ij} u_{ij}$ - so we'll use them as lagrangian constraints


Let's solve it:

  • $L = J + \text{tr}(\boldsymbol \alpha U^T) + \text{tr}(\boldsymbol \beta V^T)$
  • $\cfrac{\partial L}{\partial U} = -AV + U V^T V + \boldsymbol \alpha = 0$
  • $\cfrac{\partial L}{\partial V} = -A^T V + V U^T U + \boldsymbol \beta = 0$


The KTT Condition $\alpha_{ij} \, u_{ij} = 0$ and $\beta_{ij} \, v_{ij} = 0$:

  • $(AV)_{ij} \cdot u_{ij} - (U V^T V)_{ij} \cdot u_{ij} = 0$
  • $(A^T U)_{ij} \cdot v_{ij} - (V U^T U)_{ij} \cdot v_{ij} = 0$
  • these are independent of $\boldsymbol \alpha$ and $\boldsymbol \beta$


So we have the following update rule:

  • $u_{ij} = \cfrac{(AV)_{ij} \cdot u_{ij}}{(U V^T V)_{ij}}$
  • $v_{ij} = \cfrac{(A^T U)_{ij} \cdot v_{ij}}{(V U^T U)_{ij}}$


SVD vs NMF

  • SVD: basis is orthonormal, NMF is not
  • components of NMF are always non-negative
  • vectors in the basis of NMF directly correspond to clusters
  • we can learn cluster membership by taking the largest component of the vector (in the new representation)



Applications

Latent Semantic Analysis

Why SVD is bad for LSA?

  • if want to apply Cluster Analysis to cluster documents in the semantic space produced by SVD, need to use an external algorithm, e.g. K-Means
  • negative values are hard to interpret
  • objective of SVD is to find the orthogonal basis, which doesn't always result in a good basis for the Semantic Space:
  • PMhBY.png (source: Xu2003)


Can use NMF instead of SVD to reveal the latent structure and find the Semantic Space

  • if $k$ is small (compared to original $n$), then NMF is bound to discover latent structure (i.e. "topics") in data
  • non-negativity is good for interpretability:
  • it ensures that documents can be interpreted as non-negative combination of the key concepts (topics/clusters)
  • it can find clusters for both terms and documents: columns of $V$ are basis for docs, columns of $U$ are basis for terms



Document Clustering

Assumptions:

  • there are clusters of documents that correspond to coherent topics
  • suppose we have a document-term matrix $D$
  • and want to project documents (rows of $D$) to a $k$-dimensional semantic space
  • i.e. we want to represent each document as a linear (non-negative) combination of $k$ topics


Result of NMF can be interpreted as clustering directly:

  • decide on cluster membership by finding the base topic with greatest projection value
  • can also be fuzzy: e.g. take all topics larger than some projection value


Local NMF

It's a variant of NMF (Li2001)

  • in usual NMF vectors are not orthogonal and may be more similar to each other than desired
  • LNMF addresses this issue


3 additional constants on $U$ and $V$ that aim to find the local features in the original matrix

  • max sparsity in $V$: want to have as many 0's as possible
  • expressiveness of $U$: retain only those components of $U$ that carry most information about the matrix
  • max orthogonality of $U$


Results reported by Osinski2006:


Implementation

NMF-ED in Python/Numpy

NMF-ED = Euclidean distance minimization

  • Algorithm from Lee1999
import numpy as np

def seung_objective(V, W, H):
    ''' calculated the non-negative matrix factorization objective
        
    Usage:
        W, H = seung_update(V, W, H)
    Parameters:
        V: a (d x n)-array containing n observations in the columns
        W: (d x k)-array of non-negative basis images (components)
        H: (k x n)-array of weights, one column for each of the n observations
    Returns:
        F: a scalar objective
    '''    
    d, n = V.shape
    WH = np.dot(W, H)

    F = (V * np.log(WH) - WH).sum() / (d * n)
    return F


def seung_updateW(V, W, H):
    ''' performs the multiplicative non-negative matrix factorization updates for W
        
    Usage:
        W, H = seung_update(V, W, H)
    Parameters:
        V: a (d x n)-array containing n observations in the columns
        W: (d x k)-array of non-negative basis images (components)
        H: (k x n)-array of weights, one column for each of the n observations
    Returns:
        W: (d x k)-array of updated non-negative basis images (components)
    '''
    
    WH = np.dot(W, H)
    W_new = W * np.dot(V / WH, H.T)
    W_new = W_new / np.sum(W_new, axis=0, keepdims=True)    
    return W_new

def seung_updateH(V, W, H):
    ''' performs the multiplicative non-negative matrix factorization updates
        
    Usage:
        W, H = seung_update(V, W, H)
    Parameters:
        V: a (d x n)-array containing n observations in the columns
        W: (d x k)-array of non-negative basis images (components)
        H: (k x n)-array of weights, one column for each of the n observations
    Returns:
        H: (k x n)-array of updated weights, one column for each of the n observations
    '''
    
    WH = np.dot(W, H)
    H_new = H * np.dot((V / WH).T, W).T
    return H_new


def seung_nmf(V, k, threshold=1e-5, maxiter=500):
    ''' decomposes X into r components by non-negative matrix factorization
        
    Usage:
        W, H = seung_nmf(X, r)
    Parameters:
        V: a (d x n)-array containing n observations in the columns
        k: number of components to extract
        threshold: relative error threshold of the iteration
        maxiter: maximum number of iterations
    Returns:
        W: (d x k)-array of non-negative basis images (components)
        H: (k x n)-array of weights, one column for each of the n observations
    '''
    
    d, n = V.shape
    W = np.random.rand(d, k)
    H = np.random.rand(k, n)
    F = seung_objective(V, W, H)
    
    it_no = 0
    converged = False

    while (not converged) and it_no <= maxiter:
        W_new = seung_updateW(V, W, H)
        H_new = seung_updateH(V, W_new, H)    
        F_new = seung_objective(V, W_new, H_new)

        converged = np.abs(F_new - F) <= threshold 
        W, H = W_new, H_new
        it_no = it_no + 1
    
    return W, H


References

  • Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." 1999. [1]
  • Lee, Daniel D., and H. Sebastian Seung. "Algorithms for non-negative matrix factorization." 2001. [2]
  • Li, Stan Z., et al. "Learning spatially localized, parts-based representation." 2001. [3]

Sources

  • Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. Springer US, 2012. [4]
  • Xu, Wei, Xin Liu, and Yihong Gong. "Document clustering based on non-negative matrix factorization." 2003. [5]
  • Osinski, Stanislaw. "Improving quality of search results clustering with approximate matrix factorisations." 2006. [6]
  • Python for Machine Learning (TUB)