(One intermediate revision by the same user not shown)
Line 1: Line 1:
 
== Power Iteration ==
 
== Power Iteration ==
This is a linear algebra method for computing [[Eigenvalues and Eigenvectors]]
+
Power Iteration is a [[Linear Algebra]] method for approximating the dominant [[Eigenvalues and Eigenvectors]] of a matrix
  
 
== Symmetric Matrices ==
 
 
Suppose $A$ is symmetric
 
Suppose $A$ is symmetric
 
* then [[Eigendecomposition]] of $A$ is $A = Q \Lambda Q^T$  
 
* then [[Eigendecomposition]] of $A$ is $A = Q \Lambda Q^T$  
Line 9: Line 7:
 
* let $\mathbf q_i$ be the columns of $Q$
 
* let $\mathbf q_i$ be the columns of $Q$
  
When $k \to \infty$
+
Dominant eigenvalues and eigenvectors
* suppose $\lambda_1 > \lambda_2$,
+
* $\lambda_1$ is the dominant eigenvalue of $A$ if $|\lambda_1| > |\lambda_i|$ for all $i = 2, ...$
* then, as $k$ increases, $\lambda_1^k / \lambda_i^k \to 0$ for $i \geqslant 2$  
+
* The corresponding eigenvector $\mathbf q_1$ is also called dominant
* so $A^k \to \lambda_1^k \mathbf q_1 \mathbf q_1^T$
+
 
* so we can find $\lambda_1$ by powering $A$  
+
 
 +
=== The Power Method ===
 +
Algorithm
 +
* initial approximation - random unit vector $x_0$
 +
* $x_1 = A x_0$
 +
* $x_2 = A A x_0 = A^2 x_0$
 +
* $x_3 = A A A x_0 = A^3 x_0$
 +
* ...
 +
* until converges
 +
 
 +
For large powers of $k$, we will obrain a good approximation of the dominant eigenvector
 +
 
 +
 
 +
Finding the eigenvalue:
 +
* if $v$ is an eigenvector of $A$, then its eigenvalue is
 +
* $\lambda = \cfrac{v^T A^T v}{v^T v} = \cfrac{(A v)^T v}{v^T v}$
 +
* This is called [[Rayleigh Quotent]]
 +
* proof: suppose $\lambda$ is eigenvalue of $A$ and $v$ is its eigenvector
 +
* $\cfrac{(A v)^T v}{v^T v} = \cfrac{\lambda \, v^T v}{v^T v} = \lambda$
  
  
 
Performance:
 
Performance:
 
* computing $A \cdot A^{k-1}$ is costly
 
* computing $A \cdot A^{k-1}$ is costly
* instead, can compute $A \, A^{k-1} \mathbf x$ where $\mathbf x$ is some random unit vector
+
* go the other direction:
* $A^k \mathbf x \approx \lambda_1^k \, \mathbf q_1 \, (\mathbf q_1^T \mathbf x) = \alpha \cdot \mathbf q_1$
+
* use recurrent relation $x_{k} = A x_{k-1}$
* so it's some scalar times a unit vector $\mathbf q_1$
+
 
* thus, to recover $\mathbf q_i$, use $A^k \mathbf x / \| A^k \mathbf x \| = \mathbf q_i$
+
 
 +
if the power method generates a good approximation of the eigenvector, the approximation of the eigenvalue is also good
 +
 
 +
 
 +
=== Normalization ===
 +
The values produced by the power method are quite large:
 +
* better if we scale them down
 +
* should scale the values at each iteartion
 +
* for example, can find the largest component of $x_k$ and divide by it
 +
* more commonly, we just unit-normalize it
 +
 
 +
 
 +
=== Implementation ===
 +
Python with Numpy:
 +
 
 +
def eigenvalue(A, v):
 +
    Av = A.dot(v)
 +
    return v.dot(Av)
 +
 +
def power_iteration(A):
 +
    n, d = A.shape
 +
 +
    v = np.ones(d) / np.sqrt(d)
 +
    ev = eigenvalue(A, v)
 +
 +
    while True:
 +
        Av = A.dot(v)
 +
        v_new = Av / np.linalg.norm(Av)
 +
 +
        ev_new = eigenvalue(A, v_new)
 +
        if np.abs(ev - ev_new) < 0.01:
 +
            break
 +
 +
        v = v_new
 +
        ev = ev_new
 +
 +
    return ev_new, v_new
 +
 
 +
 
 +
=== Convergence of the Power Method ===
 +
Theorem:
 +
* if $A$ is diagonalizable and has dominant eigenvalue,
 +
* then power iteration sequence $Ax, A^2 x, A^3 x, ...$ converges to the dominant eigenvector (scaled)
 +
 
 +
Proof:
 +
* since $A$ is diagonalizable, it has $n$ linearly independent eigenvectors $v_1, ..., v_n$ and corresponding eigenvalues $\lambda_1, ..., \lambda_n$.
 +
* let $v_1$ and $\lambda_1$ be dominant.
 +
* because the eigenvectors are independent, they form a basis
 +
** so any vector $x_0$ can be represented as $x_0 = c_1 v_1 + ... + c_n v_n$
 +
** and $c_1$ must be non zero - otherwise $x_0$ is not a good choice, and we need to choose another initial vector
 +
* let's multiply both sides by $A$:
 +
** $A x_0 = c_1 A v_1 + ... + c_n A v_n = c_1 \lambda_1 v_1 + ... + c_n \lambda_n v_n$
 +
* repeat that $k$ times:
 +
** $A^k x_0 = c_1 \lambda_1^k v_1 + ... + c_n \lambda_n^k v_n$
 +
** or $A^k x_0 =  \lambda_1^k \left[ c_1 v_1 + c_2 \left(\cfrac{\lambda_2}{\lambda_1} \right)^k v_2 ... + c_n \left(\cfrac{\lambda_n}{\lambda_1} \right)^k v_n \right]$
 +
* since $\lambda_1$ is dominating, the ratios $\left(\cfrac{\lambda_i}{\lambda_1} \right)^k \to 0$ as $k \to \infty$ for all $i$
 +
* so $A^k x_0 =  \lambda_1^k c_1 v_1$ and it gets better as $k$ grows
 +
 
 +
$\square$
 +
 
 +
 
 +
From the proof we can also see that if the ratio $\lambda_2 / \lambda_1$ is small, it will converge quickly, but if it's not - it will take many iterations
 +
 
 +
Examples:
 +
* <code>A = [4 5; 6 5]</code> - needs only 6 steps, ratio is 0.1
 +
* <code>A = [-4 10; 7 5]</code> - needs 68 steps, ratio is 0.9
  
  
 
== Finding Other Eigenvectors ==
 
== Finding Other Eigenvectors ==
 +
=== Naive Method ===
 +
We can just remove the dominant direction from the matrix and repeat
 +
 +
So:
 
* $A = Q \Lambda Q^T$, so $A = \sum_{i = 1}^n \lambda_i \mathbf q_i \mathbf q_i^T$
 
* $A = Q \Lambda Q^T$, so $A = \sum_{i = 1}^n \lambda_i \mathbf q_i \mathbf q_i^T$
 
* use power iteration to find $\mathbf q_1$ and $\lambda_1$
 
* use power iteration to find $\mathbf q_1$ and $\lambda_1$
Line 32: Line 117:
  
  
 +
=== Inverse Iteration ===
 +
Inverse iteration - to find the smallest eigenvalue
 +
http://ranger.uta.edu/~huber/cse4345/Notes/Eigenvalues.pdf
  
== Issues ==
 
* if $\lambda_1 - \lambda_2 \gg 0$, then it's fine, but if $\lambda_1 - \lambda_2$ is small, it will take a lot of time to converge
 
  
 +
=== Shifts ===
 +
We can focus at any other eigenvalue by shifting our matrix
 +
* ''Shifting'' is the procedure of adding some value $s$ to the main diagonal: $A + s I$
 +
* let $\mu$ be the smallest ev of the shifted $A$
 +
** i.e. $(A + s I) \mathbf x = \mu \mathbf x$
 +
** or $(A + s I) \mathbf x - \mu \mathbf x = (A + s I - \mu I) \mathbf x = 0$
 +
** and the matrix $A + s I - \mu I$ is singular
 +
* since $A + s I - \mu I$ is singular, $\lambda = \mu - s$ is the eigenvalue of $A$ and it's the value closest to $s$
 +
* So shifting allows to focus on eigenvalues closest to any particular value
  
== Implementation ==
 
=== NumPy ===
 
<pre>
 
def eigenvalue(S, w):
 
    return w.T.dot(S).dot(w)
 
  
def power_iteration(X, debug=True):
+
Eigenvectors are not affected by shifts:
    n, d = X.shape
+
* suppose $A \mathbf x = \lambda \mathbf x$
    converged = False
+
* let's shift by $m$: $(A + m I) \mathbf x = A \mathbf x + m \mathbf x = \lambda \mathbf x + m \mathbf x = (\lambda + m) \mathbf x$
 +
* for the shifted $A$, the eigenvalue is $\lambda + m$, and the eigenvector stays the same
  
    S = X.T.dot(X)
 
    w = np.ones(d) / np.sqrt(d)
 
    error = eigenvalue(S, w)
 
  
    while not converged:
 
        Sw = S.dot(w)
 
        w_new = Sw / np.linalg.norm(Sw)
 
  
        error_new = eigenvalue(S, w_new)
+
=== Orthogonal Iteration ===
        converged = np.abs(error - error_new) < 0.01
+
Idea:
   
+
* we know that other eigenvectors are orthogonal to the dominant one
        w = w_new
+
* so we can use the power method, and force that the second vector is orthogonal to the first one
        error = error_new
+
* this way we guarante that they will converge to two different eigenvectors
 +
* we can do this for many vectors, not just two
 +
* this is called "Orthogonal Iteration"
  
    return w
+
Orthogonal Iteration
</pre>
+
* Orthogonal Iteration is a block version of the Power Method, also sometimes called "Simultaneous (Power) Interation"
 +
* Instead of multiplying $A$ by just one vector, we multiply by multiple vectors $\{ q_1, ... q_r \}$, which we put in a matrix $Q$.
 +
* At each step we re-normalize the vectors, e.g. with [[QR Decomposition]]
  
  
=== Mahout Samsara ===
+
Algorithm:
<pre>
+
* choose $Q_0$ such that $Q_0^T Q_0 = I$
var x: Vector = 1 to dim map (_ => 1.0 / Math.sqrt(dim))
+
* for $k = 1, 2, ...$:
var converged = false
+
** $Z_k = A Q_{k-1}$
 +
** $Q_k R_k = Z_k$ (QR decomposition)
  
while (!converged) {
 
  val Ax = A %*% x
 
  var x_new = Ax.collect(::, 0)
 
  x_new = x_new / x_new.norm(2)
 
  
  val diff = (x_new - x).norm(2)
+
Let's look at the sequence of $\{ R_k \}$:
  converged = diff < 1e-6
+
* $R_k = Q_k^T Z_k = Q_k^T A Q_{k - 1}$
  x = x_new
+
* if $\{ Q_k \}$ converges to some $Q$ then $Q^T A Q = R$ is upper triangular
}
+
* This is a [[Schur Decomposition]] of $A$
</pre>
+
* Thus, the eigenvalues of $A$ are located on the main diagonal of $R$
 +
* And the columns of $Q$ are the eigenvectors
  
  
== Applications ==
+
=== Implementation ===
 +
Python with numpy:
 +
 
 +
def simultaneous_power_iteration(A, k):
 +
    n, m = A.shape
 +
    Q = np.random.rand(n, k)
 +
    Q, _ = np.linalg.qr(Q)
 +
    Q_prev = Q
 +
 
 +
    for i in range(1000):
 +
        Z = A.dot(Q)
 +
        Q, R = np.linalg.qr(Z)
 +
 +
        # can use other stopping criteria as well
 +
        err = ((Q - Q_prev) ** 2).sum()
 +
        if i % 10 == 0:
 +
            print(i, err)
 +
 +
        Q_prev = Q
 +
        if err < 1e-3:
 +
            break
 +
 +
    return np.diag(R), Q
 +
 
 +
 
 +
 
 +
== Algorithms based on Power Iteration ==
 +
=== [[QR Algorithm]] ===
 +
Simultaneous Iteration typically takes a while to converge
 +
* Another, better way to do it is QR algorithm.
 +
* It's based on Simultaneous Iteration algorithm
 +
 
 +
 
 +
=== [[Lanczos Algorithm]] ===
 +
Lanczos algorithm is used for tridiagonalization of a matrix $A$
 +
* It's based on the power iteration algorithm
 +
* And can be used to find eigenvalues and eigenvectors
 +
 
 +
 
 +
=== Applications ===
 
* [[Principal Component Analysis]]
 
* [[Principal Component Analysis]]
 
* [[Stochastic Matrices]] and [[PageRank]]
 
* [[Stochastic Matrices]] and [[PageRank]]
 +
  
  
 
== Sources ==
 
== Sources ==
* Hopcroft, John, and Ravindran Kannan. "Foundations of Data Science1." (2014) [http://research.microsoft.com/en-US/people/kannan/book-mar-30-2014.pdf]
 
 
* [[Machine Learning 1 (TUB)]]
 
* [[Machine Learning 1 (TUB)]]
 +
* Hopcroft, John, and Ravindran Kannan. "Foundations of Data Science1." (2014) [http://research.microsoft.com/en-US/people/kannan/book-mar-30-2014.pdf]
 +
* https://www2.math.ethz.ch/education/bachelor/seminars/fs2008/nas/grasic.pdf
 +
* http://college.cengage.com/mathematics/larson/elementary_linear/5e/students/ch08-10/chap_10_3.pdf
 +
* http://www.math.usm.edu/lambers/mat610/sum10/lecture14.pdf
  
 
[[Category:Linear Algebra]]
 
[[Category:Linear Algebra]]
 
[[Category:Python]]
 
[[Category:Python]]
 
[[Category:Machine Learning]]
 
[[Category:Machine Learning]]

Latest revision as of 18:26, 26 June 2017

Power Iteration

Power Iteration is a Linear Algebra method for approximating the dominant Eigenvalues and Eigenvectors of a matrix

Suppose $A$ is symmetric

  • then Eigendecomposition of $A$ is $A = Q \Lambda Q^T$
  • and $A^k = Q \Lambda^k Q^T$
  • let $\mathbf q_i$ be the columns of $Q$

Dominant eigenvalues and eigenvectors

  • $\lambda_1$ is the dominant eigenvalue of $A$ if $|\lambda_1| > |\lambda_i|$ for all $i = 2, ...$
  • The corresponding eigenvector $\mathbf q_1$ is also called dominant


The Power Method

Algorithm

  • initial approximation - random unit vector $x_0$
  • $x_1 = A x_0$
  • $x_2 = A A x_0 = A^2 x_0$
  • $x_3 = A A A x_0 = A^3 x_0$
  • ...
  • until converges

For large powers of $k$, we will obrain a good approximation of the dominant eigenvector


Finding the eigenvalue:

  • if $v$ is an eigenvector of $A$, then its eigenvalue is
  • $\lambda = \cfrac{v^T A^T v}{v^T v} = \cfrac{(A v)^T v}{v^T v}$
  • This is called Rayleigh Quotent
  • proof: suppose $\lambda$ is eigenvalue of $A$ and $v$ is its eigenvector
  • $\cfrac{(A v)^T v}{v^T v} = \cfrac{\lambda \, v^T v}{v^T v} = \lambda$


Performance:

  • computing $A \cdot A^{k-1}$ is costly
  • go the other direction:
  • use recurrent relation $x_{k} = A x_{k-1}$


if the power method generates a good approximation of the eigenvector, the approximation of the eigenvalue is also good


Normalization

The values produced by the power method are quite large:

  • better if we scale them down
  • should scale the values at each iteartion
  • for example, can find the largest component of $x_k$ and divide by it
  • more commonly, we just unit-normalize it


Implementation

Python with Numpy:

def eigenvalue(A, v):
    Av = A.dot(v)
    return v.dot(Av)

def power_iteration(A):
    n, d = A.shape

    v = np.ones(d) / np.sqrt(d)
    ev = eigenvalue(A, v)

    while True:
        Av = A.dot(v)
        v_new = Av / np.linalg.norm(Av)

        ev_new = eigenvalue(A, v_new)
        if np.abs(ev - ev_new) < 0.01:
            break

        v = v_new
        ev = ev_new

   return ev_new, v_new


Convergence of the Power Method

Theorem:

  • if $A$ is diagonalizable and has dominant eigenvalue,
  • then power iteration sequence $Ax, A^2 x, A^3 x, ...$ converges to the dominant eigenvector (scaled)

Proof:

  • since $A$ is diagonalizable, it has $n$ linearly independent eigenvectors $v_1, ..., v_n$ and corresponding eigenvalues $\lambda_1, ..., \lambda_n$.
  • let $v_1$ and $\lambda_1$ be dominant.
  • because the eigenvectors are independent, they form a basis
    • so any vector $x_0$ can be represented as $x_0 = c_1 v_1 + ... + c_n v_n$
    • and $c_1$ must be non zero - otherwise $x_0$ is not a good choice, and we need to choose another initial vector
  • let's multiply both sides by $A$:
    • $A x_0 = c_1 A v_1 + ... + c_n A v_n = c_1 \lambda_1 v_1 + ... + c_n \lambda_n v_n$
  • repeat that $k$ times:
    • $A^k x_0 = c_1 \lambda_1^k v_1 + ... + c_n \lambda_n^k v_n$
    • or $A^k x_0 = \lambda_1^k \left[ c_1 v_1 + c_2 \left(\cfrac{\lambda_2}{\lambda_1} \right)^k v_2 ... + c_n \left(\cfrac{\lambda_n}{\lambda_1} \right)^k v_n \right]$
  • since $\lambda_1$ is dominating, the ratios $\left(\cfrac{\lambda_i}{\lambda_1} \right)^k \to 0$ as $k \to \infty$ for all $i$
  • so $A^k x_0 = \lambda_1^k c_1 v_1$ and it gets better as $k$ grows

$\square$


From the proof we can also see that if the ratio $\lambda_2 / \lambda_1$ is small, it will converge quickly, but if it's not - it will take many iterations

Examples:

  • A = [4 5; 6 5] - needs only 6 steps, ratio is 0.1
  • A = [-4 10; 7 5] - needs 68 steps, ratio is 0.9


Finding Other Eigenvectors

Naive Method

We can just remove the dominant direction from the matrix and repeat

So:

  • $A = Q \Lambda Q^T$, so $A = \sum_{i = 1}^n \lambda_i \mathbf q_i \mathbf q_i^T$
  • use power iteration to find $\mathbf q_1$ and $\lambda_1$
  • then let $A_2 \leftarrow A - \lambda_1 \mathbf q_1 \mathbf q_1^T$
  • repeat power iteration on $A_2$ to find $\mathbf q_2$ and $\lambda_2$
  • continue like this for $\lambda_3, \ ... \ , \lambda_n$


Inverse Iteration

Inverse iteration - to find the smallest eigenvalue http://ranger.uta.edu/~huber/cse4345/Notes/Eigenvalues.pdf


Shifts

We can focus at any other eigenvalue by shifting our matrix

  • Shifting is the procedure of adding some value $s$ to the main diagonal: $A + s I$
  • let $\mu$ be the smallest ev of the shifted $A$
    • i.e. $(A + s I) \mathbf x = \mu \mathbf x$
    • or $(A + s I) \mathbf x - \mu \mathbf x = (A + s I - \mu I) \mathbf x = 0$
    • and the matrix $A + s I - \mu I$ is singular
  • since $A + s I - \mu I$ is singular, $\lambda = \mu - s$ is the eigenvalue of $A$ and it's the value closest to $s$
  • So shifting allows to focus on eigenvalues closest to any particular value


Eigenvectors are not affected by shifts:

  • suppose $A \mathbf x = \lambda \mathbf x$
  • let's shift by $m$: $(A + m I) \mathbf x = A \mathbf x + m \mathbf x = \lambda \mathbf x + m \mathbf x = (\lambda + m) \mathbf x$
  • for the shifted $A$, the eigenvalue is $\lambda + m$, and the eigenvector stays the same


Orthogonal Iteration

Idea:

  • we know that other eigenvectors are orthogonal to the dominant one
  • so we can use the power method, and force that the second vector is orthogonal to the first one
  • this way we guarante that they will converge to two different eigenvectors
  • we can do this for many vectors, not just two
  • this is called "Orthogonal Iteration"

Orthogonal Iteration

  • Orthogonal Iteration is a block version of the Power Method, also sometimes called "Simultaneous (Power) Interation"
  • Instead of multiplying $A$ by just one vector, we multiply by multiple vectors $\{ q_1, ... q_r \}$, which we put in a matrix $Q$.
  • At each step we re-normalize the vectors, e.g. with QR Decomposition


Algorithm:

  • choose $Q_0$ such that $Q_0^T Q_0 = I$
  • for $k = 1, 2, ...$:
    • $Z_k = A Q_{k-1}$
    • $Q_k R_k = Z_k$ (QR decomposition)


Let's look at the sequence of $\{ R_k \}$:

  • $R_k = Q_k^T Z_k = Q_k^T A Q_{k - 1}$
  • if $\{ Q_k \}$ converges to some $Q$ then $Q^T A Q = R$ is upper triangular
  • This is a Schur Decomposition of $A$
  • Thus, the eigenvalues of $A$ are located on the main diagonal of $R$
  • And the columns of $Q$ are the eigenvectors


Implementation

Python with numpy:

def simultaneous_power_iteration(A, k):
    n, m = A.shape
    Q = np.random.rand(n, k)
    Q, _ = np.linalg.qr(Q)
    Q_prev = Q
 
    for i in range(1000):
        Z = A.dot(Q)
        Q, R = np.linalg.qr(Z)

        # can use other stopping criteria as well 
        err = ((Q - Q_prev) ** 2).sum()
        if i % 10 == 0:
            print(i, err)

        Q_prev = Q
        if err < 1e-3:
            break

    return np.diag(R), Q


Algorithms based on Power Iteration

QR Algorithm

Simultaneous Iteration typically takes a while to converge

  • Another, better way to do it is QR algorithm.
  • It's based on Simultaneous Iteration algorithm


Lanczos Algorithm

Lanczos algorithm is used for tridiagonalization of a matrix $A$

  • It's based on the power iteration algorithm
  • And can be used to find eigenvalues and eigenvectors


Applications


Sources