ML Wiki

Singular Value Decomposition

SVD is a decomposition of rectangular $m \times n$ matrix $A$ as

• $A = U \Sigma V^T$ where
• $U$ is an $m \times m$ orthogonal matrix with Eigenvectors of $A A^T$
• $\Sigma$ is an diagonal $m \times n$ matrix with Eigenvalues of both $A^T A$ and $A A^T$
• $V$ is an $n \times n$ orthogonal matrix with Eigenvalues of $A^T A$

Orthogonal Basis for the Four Fundamental Subspaces

But it's not only a decomposition, but a way of finding the bases for the Four Fundamental Subspaces of $A$: • Singular vectors $\mathbf v_1, \ ... \ , \mathbf v_r$ are in the row space of $A$
• applying $A$ to $\mathbf v_i$ gives $A \mathbf v_i = \sigma_i \mathbf u_i$
• $\mathbf u_1, \ ... \ , \mathbf u_r$ are in the column space of $A$
• Singular values $\sigma_1, \ ... \ , \sigma_r$ are all positive numbers
• so $V$ and $U$ diagonalize $A$:
• $A \mathbf v_i = \sigma_i \mathbf u_i$ $\Rightarrow$ $A V = \Sigma U$
• The singular values $\sigma_i$ in $\Sigma$ are arranged in monotonic non-increasing order

EVD vs SVD

Eigenvalue Decomposition

Problems with general Eigendecomposition $A = S \, \Lambda \, S^{-1}$:

• doesn't work with rectangular matrices
• eigenvalues in $S$ are usually not orthonormal (unless $A$ is symmetric)

Our goal:

• $A = U \Sigma V^T$
• we want to find the orthogonal basis in the Row Space $C(A^T)$ of $A$
• and we then map this basis to some orthogonal basis in the Column Space $C(A)$ of $A$
• these vectors are called singular vectors

Solution:

• choose basis from $AA^T$ and $A^T A$ - they are symmetric and have orthonormal basis

Spectral Theorem

SVD extends the Spectral Theorem

• it's EVD for all symmetric positive-definite matrices
• we extend EVD to all rectangular matrices $A$

Finding SVD

Goal:

• find orthonormal bases in the row space of $A$ as well as in the column space of $A$
• s.t. $A$ maps from row space basis to the column space basis
• and the matrix $A$ is diagonal w.r.t. this basis

Orthogonalization

Finding orthogonal basis for the rowspace $C(A^T)$

• let $r$ be the rank of $A$
• select orthonormal basis $\mathbf v_1, \ ... \ , \mathbf v_r$ in $\mathbb R^n$ s.t. it spans the Row Space of $A$
• e.g. using the Gram-Schmidt Process on the rows of $A^T$
• continue the process to find $\mathbf v_{r+1}, \ ... \ , \mathbf v_n$ in $\mathbb R^n$ s.t it spans the Nullspace of $A$

Then for $i = 1 .. r$ define $\mathbf u_i$ as $A \mathbf v_i$

• i.e. $A \mathbf v_i = \sigma_i \mathbf u_i$
• extend this to a basis in $\mathbb R^m$
• relative to these bases, $A$ will have diagonal representation

Here $\{ \ \mathbf v_i \ \}$ are orthogonal by construction

• but $\{ \ \mathbf u_i \ \}$ aren't necessarily orthogonal
• we want to find such $\{ \ \mathbf v_i \ \}$ that $\{ \ \mathbf u_i \ \}$ are also orthogonal

We can use EVD to find the right basis

• Let $\{ \ \mathbf v_i \ \}$ be eigenvectors of $A^T A$ with $\lambda_i$ being corresponding eigenvalues
• so $A^T A \mathbf v_i = \lambda_i \mathbf v_i$ and EVD is $A^T A = V \Lambda V^T$ (with $\mathbf v_i$ being the columns of $V$)

Will it give the right bases?

• the Inner Product $\langle A \mathbf v_i, A \mathbf v_j \rangle$ is $(A \mathbf v_i)^T (A \mathbf v_j) = \mathbf v_i^T A^T A \mathbf v_j = \mathbf v_i^T (A^T A \mathbf v_j) = \mathbf v_i^T \lambda_j \mathbf v_j = \lambda_j \mathbf v_i^T \mathbf v_j$
• if $i \ne j$, then $\mathbf v_i^T \mathbf v_j =0$
• so the image $\big\{ A \mathbf v_1, \ ... \ , A \mathbf v_n \big\}$ is also orthogonal

Finding the orthonormal $\{ \ \mathbf u_i \ \}$

• vectors $A \mathbf v_i$ are orthogonal, but not orthonormal
• $\| A \mathbf v_i \|^2 = \langle A \mathbf v_i, A \mathbf v_i \rangle = \mathbf v_i^T A^T A \mathbf v_i = \mathbf v_i^T \lambda_i \mathbf v_i = \lambda_i$
• let $\mathbf u_i = \cfrac{A \mathbf v_i}{\| A \mathbf v_i \|} = \cfrac{1}{\sqrt{\lambda_i}} A \mathbf v_i$ for $i = 1 .. r$
• if $r < m$, we extend this basis for $\mathbb R^m$

This completes the construction for the bases

• Let $\sigma_i = \sqrt{\lambda_i}$. Then $\mathbf u_i = \cfrac{1}{\sigma_i} A \mathbf v_i$
• or $A \mathbf v_i = \sigma_i \mathbf u_i$
• Put $\{ \mathbf v_1, \ ... \ , \mathbf v_r \}$ in columns of $V$ and $\{ \mathbf u_1, \ ... \ , \mathbf u_r \}$ in columns of $U$
• so we'll have $A V = U \Sigma$
• thus, SVD is $A = U \Sigma V^T$

Summary:

• $A$ is $m \times n$ real matrix
• express $A = U \Sigma V^T$
• $V$ is obtained from diagonal factorization $A^T A = V \Lambda V^T$
• $U$ is normalized image $\big\{ A \mathbf v_1, \ ... \ , A \mathbf v_n \big\}$
• non-zero entries $\sigma_i$ of $\Sigma$ are square roots of $\lambda_i$ from $\Lambda$: $\sigma_i = \sqrt{\lambda_i}$

This construction shows that SVD exists, but it doesn't mean that it's the most effective way of implementing it

• the computation of $A^T A$ can lead to loss of precision (because of the way numbers are stored in memory)
• there are direct methods of computing SVD on $A$, without having to compute $A^T A$

There's duality: we can do the save for $AA^T$:

• EVD is $AA^T = U \Lambda U^T$, $\mathbf u_i$ are columns of $U$
• let's apply $A^T$ to these $\mathbf u_i$
• The image of this tranformation is also orthogonal: $\langle A^T \mathbf u_i, A^T \mathbf u_j \rangle = \lambda_i$ if $i = j$ and $0$ otherwise
• we normalize $A^T \mathbf u_i$ by $\sigma_i = \sqrt{\lambda_i}$
• so it's completely the same, but coming from the column space side

$\Sigma$: Eigenvalues of $A^T A$ and $AA^T$

What is more, the eigenvalues of $A^T A$ and $AA^T$ are the same!

Let's first show that if $\lambda$ is eigenvalue for $A^T A$, then it's an eigenvalue for $AA^T$

• let $\lambda \ne 0$ be an eigenvalue of $A^T A$ with corresponding eigenvector $\mathbf v \ne \mathbf 0$
• then $A^T A \mathbf v = \lambda \mathbf v$. Multiply by $A$ on the left:
• $A A^T A \mathbf v = \lambda A \mathbf v$
• let $\mathbf u = A \mathbf v$, then $A A^T \mathbf u = \lambda \mathbf u$
• so $\lambda$ is an eigenvalue for $A A^T$ as well, with eigenvector $\mathbf u = A \mathbf v$

Now show that if $\lambda$ is eigenvalue for $AA^T$ then it's also eigenvalue for $A^T A$

• same idea as before
• let $\lambda \ne 0$ be an eigenvalue of $A A^T$ with corresponding eigenvector $\mathbf u \ne \mathbf 0$
• then $A A^T \mathbf u = \lambda \mathbf u$. Multiply by $A^T$ on the left
• $A^T A A^T \mathbf u = \lambda A^T \mathbf u$
• by letting $\mathbf v = A^T \mathbf u$ we have $A^T A \mathbf v = \lambda \mathbf v$
• so $\lambda$ is an eigenvalue for $A A^T$ as well, with eigenvector $\mathbf v = A^T \mathbf u$

$\square$

Calculating eigenvalues

• So, for example, if $A$ is $500 \times 2$, then $AA^T$ is $500 \times 500$ and $A^T A$ is $2 \times 2$
• we calculate eigenvalues for $A^T A$, (there are 2 of them)
• and we know that $AA^T$ has the same 2 eigenvalues - with the rest 498 being 0

Reconstructing EVD from SVD

We saw how to construct SVD using EVD, but we can also reconstruct EVD from SVD

• let $A = U \Sigma V^T$, then
• $A^T A = V \Sigma^T \Sigma V^T = V \Sigma^2 V^T$ is EVD of $A^T A$
• $A A^T = U \Sigma \Sigma^T U^T = U \Sigma^2 U^T$ is EVD of $A A^T$
• where $\Sigma^T \Sigma = \Sigma \Sigma^T = \Sigma^2 = \text{diag}(\sigma_1^2, \ ... \ , \sigma_r^2)$

If $A$ is square and symmetric, then $A = A^T$ and $A^T A = A A^T = A^2$

• and any eigenvector $\mathbf v$ of $A$ with eigenvalue $\lambda$ is eigenvector of $A^2$ with eigenvalue $\lambda^2$
• so $U = V$ and EVD = SVD when $A$ is positive semi-definite (no negative eigenvalues)

Geometric Interpretation

Let's understand how $A$ deforms the space

• consider a unit sphere in $\mathbb R^n$
• a vector $\mathbf x \in \mathbb R^n$ is represented as $\mathbf x = \sum x_i \mathbf v_i$
• because it's a sphere, $\sum x_i^2 = 1$
• then the image $A \mathbf x = \sum x_i A \mathbf v_i = \sum x_i A \mathbf v_i = \sum \sigma_i x_i \mathbf u_i$
• let $y_i = x_i \sigma_i$
• then $A \mathbf x = \sum y_i \mathbf u_i$
• $\sum\limits_{i = 1}^r \cfrac{y_i^2}{\sigma_i^2} = \sum\limits_{i = 1}^r x_i^2 \leqslant 1$
• if $A$ has full rank, then the sum is strictly $1$

So $A$ maps the unit sphere in $\mathbb R^n$ to some $r$-dimensional ellipsoid in $\mathbb R^m$ with axes in directions $\mathbf u_i$, each with magnitudes $\sigma_i$

• Linear transformation:
• So first it collapses $n - r$ dimensions of the domain
• then it distorts the remaining dimensions stretching and squeezing the $r$-dim unit sphere into an ellipsoid
• finally it embeds the ellipsoid into $\mathbb R^m$
• • From (Kalman96)
• $n = m = 3$, $r = 2$

Another way:

• • From (Strang93)

Representation

Partitioned Matrices

Let's have a look at $A = U \Sigma V^T$ for $m \times n$ matrix $A$:

• $A = \left[ \begin{array}{cccc|ccc} | & | & & | & | & & | \\ | & | & & | & | & & | \\ \mathbf u_1 & \mathbf u_2 & \cdots & \mathbf u_r & \mathbf u_{r+1} & \cdots & \mathbf u_m \\ | & | & & | & | & & | \\ | & | & & | & | & & | \\ \end{array} \right] \left[ \begin{array}{cccc|ccc} \sigma_1 & & & & \\ & \sigma_2 & & & & \\ & & \ddots & & & \\ & & & \sigma_r & & \\ \hline & & & & 0 & \\ & & & & & \ddots & \\ & & & & & & 0 \\ \end{array} \right] \begin{bmatrix} - & \mathbf v_1^T & - \\ & \vdots & \\ - & \mathbf v_1^T & - \\ \hline - & \mathbf v_{r+1}^T & - \\ & \vdots & \\ - & \mathbf v_n^T & - \\ \end{bmatrix}$
• Then using Matrix Multiplication for block-partitioned matrices, we see that
• $A = \begin{bmatrix} | & | & & | \\ \mathbf u_1 & \mathbf u_2 & \cdots & \mathbf u_r \\ | & | & & | \\ \end{bmatrix} \begin{bmatrix} \sigma_1 & & \\ & \sigma_2 & & \\ & & \ddots & \\ & & & \sigma_r \\ \end{bmatrix} \begin{bmatrix} - & \mathbf v_1^T & - \\ & \vdots & \\ - & \mathbf v_1^T & - \\ \end{bmatrix} + \begin{bmatrix} | & & | \\ \mathbf u_{r+1} & \cdots & \mathbf u_m \\ | & & | \\ \end{bmatrix} \begin{bmatrix} 0 & & \\ & \ddots & \\ & & 0 \\ \end{bmatrix} \begin{bmatrix} - & \mathbf v_{r+1}^T & - \\ & \vdots & \\ - & \mathbf v_n^T & - \\ \end{bmatrix}$
• so, $A = \begin{bmatrix} | & | & & | \\ \mathbf u_1 & \mathbf u_2 & \cdots & \mathbf u_r \\ | & | & & | \\ \end{bmatrix} \begin{bmatrix} \sigma_1 & & \\ & \sigma_2 & & \\ & & \ddots & \\ & & & \sigma_r \\ \end{bmatrix} \begin{bmatrix} - & \mathbf v_1^T & - \\ & \vdots & \\ - & \mathbf v_1^T & - \\ \end{bmatrix}$
• so only first $r$ $\mathbf v_i$'s and $\mathbf u_i$'s contribute something
• now $U$ and $V$ become rectangular and $\Sigma$ square:

SVD is $A = U \Sigma V^T$

• $U$ is $m \times r$ matrix s.t. $U^T U = I$
• $\Sigma$ is $r \times r$ diagonal matrix $\text{diag}(\sigma_1, \ ... \ , \sigma_r)$
• $V$ is $n \times r$ matrix s.t. $V^T V = I$

Outer Product Form

A matrix multiplication $AB$ can be expressed as a sum of outer products:

• let $A$ be $n \times k$ matrix and $B$ be $k \times m$ matrix
• then $AB = \sum\limits_{i=1}^k \mathbf a_i \mathbf b_i^T$
• where $\mathbf a_i$ are columns of $A$ and $\mathbf b_i$ are rows of $B$

Thus we can represent $A = U \Sigma V^T$ as sum of outer products:

• $A = \sum\limits_{i = 1}^r \sigma_i \mathbf u_i \mathbf v_i^T$

It gives another way of thinking about the Linear Tranformation $f(\mathbf x) = A \mathbf x$

• $A \mathbf x = (\sum \sigma_i \mathbf u_i \mathbf v_i^T) \mathbf x = \sum \sigma_i \mathbf u_i (\mathbf v_i^T \mathbf x) = \sum \sigma_i (\mathbf v_i^T \mathbf x) \mathbf u_i$
• so we express $A \mathbf x$ as a linear combination of $\{ \ \mathbf u_i \ \}$

Truncated SVD

Usual SVD:

• $A = U \Sigma V$
• $\sigma_i$ in $\text{diag}(\Sigma)$ are in non-increasing order
• so we can keep only first $k$ singular values of $\Sigma$ (and set the rest to 0) and get the best rank-$k$ approximation of $A$
• this is the best approximation in terms of Total Least Squares (see Reduced Rank Approximation)

In terms of sum of rank-1 matrices, we can approximate $A$ by

• $A_k = \sum_{i = 1}^k \sigma_i \mathbf u_i \mathbf v_i^T$

Properties & Questions

Column Space and Row Space

Given SVD $A V = U \Sigma$, why $U$ in is the column space of $A$ and $V$ is the row space?

• For all $i$: $A \mathbf v_i = \sigma_i \mathbf u_i$. Since there's a solution, then $\sigma_i \mathbf u_i \in C(A)$
• for all $i$: $A^T \mathbf u_i = \sigma_i \mathbf v_i$. Then $\sigma_i \mathbf u_i \in C(A^T)$ which is the row space of $A$

Applications

Dimensionality Reduction

• PCA is often implemented through SVD

Data Compression

It's like Discrete Fourier Transformation:

• in DFT we represent a data vector in orthogonal basis of sines and cosines
• often there are only a few principal frequencies that account for most variability in the data and the rest can be discarded
• SVD does the same, but it find the best orthogonal basis instead of using a predefined one
• so we can see SVD as adaptive generalization of DFT

Image Compression

• images can be represented as Matrices, so we can apply SVD and PCA to them
• • source: SVD at work  from 

Latent Semantic Analysis

• When used as a Dimensionality Reduction technique for Term-Document matrix
• it helps revealing some hidden semantic patterns

Linear Least Squares

As a technique for faster Normal Equation computation

Others

There are many other applications