Line 13: | Line 13: | ||
* $n$ features, $\mathbf x_i = \big[x_{i1}, \ ... \ , x_{in} \big]^T \in \mathbb{R}^n$ | * $n$ features, $\mathbf x_i = \big[x_{i1}, \ ... \ , x_{in} \big]^T \in \mathbb{R}^n$ | ||
* We can put all such $\mathbf x_i$ as rows of a matrix $X$ (sometimes called a ''design matrix'') | * We can put all such $\mathbf x_i$ as rows of a matrix $X$ (sometimes called a ''design matrix'') | ||
− | * | + | * <math>X = \begin{bmatrix} |
- \ \mathbf x_1^T - \\ | - \ \mathbf x_1^T - \\ | ||
\vdots \\ | \vdots \\ | ||
Line 21: | Line 21: | ||
& \ddots & \\ | & \ddots & \\ | ||
x_{m1} & \cdots & x_{mn} \\ | x_{m1} & \cdots & x_{mn} \\ | ||
− | \end{bmatrix} | + | \end{bmatrix}</math> |
− | * the observed values: | + | * the observed values: <math>\mathbf y = \begin{bmatrix} |
y_1 \\ \vdots \\ y_m | y_1 \\ \vdots \\ y_m | ||
− | \end{bmatrix} \in \mathbb{R}^{m} | + | \end{bmatrix} \in \mathbb{R}^{m}</math> |
* Thus, we expressed our problem in the matrix form: $X \mathbf w = \mathbf y$ | * Thus, we expressed our problem in the matrix form: $X \mathbf w = \mathbf y$ | ||
* Note that there's usually additional feature $x_{i0} = 1$ - the slope, | * Note that there's usually additional feature $x_{i0} = 1$ - the slope, | ||
− | ** so | + | ** so <math>\mathbf x_i \in \mathbb{R}^{n+1}</math> and <math>X = \begin{bmatrix} |
- \ \mathbf x_1^T - \\ | - \ \mathbf x_1^T - \\ | ||
- \ \mathbf x_2^T - \\ | - \ \mathbf x_2^T - \\ | ||
Line 37: | Line 37: | ||
& & \ddots & \\ | & & \ddots & \\ | ||
x_{m0} & x_{m1} & \cdots & x_{mn} \\ | x_{m0} & x_{m1} & \cdots & x_{mn} \\ | ||
− | \end{bmatrix} \in \mathbb R^{m \times n + 1} | + | \end{bmatrix} \in \mathbb R^{m \times n + 1}</math> |
Line 141: | Line 141: | ||
=== $\mathbb R^2$ Case === | === $\mathbb R^2$ Case === | ||
Suppose we have the following dataset: | Suppose we have the following dataset: | ||
− | * ${\cal D} = \{ (1,1), (2,2), (3,2) \}$ | + | * ${\cal D} = \big\{ (1,1), (2,2), (3,2) \big\}$ |
so we have this system: | so we have this system: | ||
− | * | + | * <math>\left\{\begin{array}{l} |
x_0 + x_1 = 1\\ | x_0 + x_1 = 1\\ | ||
x_0 + 2 x_1 = 2\\ | x_0 + 2 x_1 = 2\\ | ||
x_0 + 2 x_1 = 3\\ | x_0 + 2 x_1 = 3\\ | ||
− | \end{array}\right. | + | \end{array}\right.</math> |
* first column is always 1 because it's our intercept term $x_0$, and $x_1$ is the slope | * first column is always 1 because it's our intercept term $x_0$, and $x_1$ is the slope | ||
− | * the matrix form is | + | * the matrix form is <math>\begin{bmatrix} |
1 & 1\\ | 1 & 1\\ | ||
1 & 2\\ | 1 & 2\\ | ||
Line 160: | Line 160: | ||
\begin{bmatrix} | \begin{bmatrix} | ||
1 \\ 2 \\ 2 | 1 \\ 2 \\ 2 | ||
− | \end{bmatrix} | + | \end{bmatrix}</math> |
* no line goes through these points at once | * no line goes through these points at once | ||
− | * so we solve | + | * so we solve <math>A^T A \mathbf{\hat x} = A^T \mathbf b</math> |
− | * | + | * <math>\begin{bmatrix} |
1 & 1 & 1 \\ | 1 & 1 & 1 \\ | ||
1 & 2 & 3 \\ | 1 & 2 & 3 \\ | ||
Line 173: | Line 173: | ||
3 & 6\\ | 3 & 6\\ | ||
6 & 14\\ | 6 & 14\\ | ||
− | \end{bmatrix} | + | \end{bmatrix}</math> |
* this system is invertible, so we solve it and get $\hat x_0 = 2/3, \hat x_1 = 1/2$ | * this system is invertible, so we solve it and get $\hat x_0 = 2/3, \hat x_1 = 1/2$ | ||
* thus the best line is $y = x_0 + x_1 t = 2/3 + 1/2 t$ | * thus the best line is $y = x_0 + x_1 t = 2/3 + 1/2 t$ | ||
Line 183: | Line 183: | ||
* we want to make the overall error as small as possible | * we want to make the overall error as small as possible | ||
* recall that $\mathbf e$ is our projection error - so we want to minimize it | * recall that $\mathbf e$ is our projection error - so we want to minimize it | ||
− | * usually we minimize the square: $\min \| \mathbf e \|^2 = \min \big | + | * usually we minimize the square: $\min \| \mathbf e \|^2 = \min \big[ e_1^2 + e_2^2 + e_3^2 \big]$ |
* so we minimize this: $\| \mathbf e \|^2 = \| A \mathbf x - \mathbf b \|^2$ | * so we minimize this: $\| \mathbf e \|^2 = \| A \mathbf x - \mathbf b \|^2$ | ||
* we claim that the solution to $A^T A \mathbf{\hat x} = A^T \mathbf b$ minimizes $\| A \mathbf x - \mathbf b \|^2$ | * we claim that the solution to $A^T A \mathbf{\hat x} = A^T \mathbf b$ minimizes $\| A \mathbf x - \mathbf b \|^2$ | ||
Line 192: | Line 192: | ||
Let's check if $\mathbf p \; \bot \; \mathbf e$ | Let's check if $\mathbf p \; \bot \; \mathbf e$ | ||
− | * | + | * <math>\mathbf{\hat x} = \begin{bmatrix} \hat x_0 \\ \hat x_1 \end{bmatrix} = \begin{bmatrix} 2/3 \\ 1/2 \end{bmatrix}</math> |
− | * thus | + | * thus <math>\mathbf p = A \mathbf{ \hat x } = \begin{bmatrix} p_1 \\ p_2 \\ p_3 \end{bmatrix} = \begin{bmatrix} 7/6 \\ 5/3 \\ 13/6 \end{bmatrix}</math> |
* $\mathbf p + \mathbf e = \mathbf b$, so $\mathbf e = \mathbf b - \mathbf p = \begin{bmatrix} 1 - 7/6 \\ 2 - 5/3 \\ 2 - 13/6 \end{bmatrix} = \begin{bmatrix} - 1/6 \\ 2/3 \\ -1/6 \end{bmatrix} $ | * $\mathbf p + \mathbf e = \mathbf b$, so $\mathbf e = \mathbf b - \mathbf p = \begin{bmatrix} 1 - 7/6 \\ 2 - 5/3 \\ 2 - 13/6 \end{bmatrix} = \begin{bmatrix} - 1/6 \\ 2/3 \\ -1/6 \end{bmatrix} $ | ||
* $\mathbf p \; \bot \; \mathbf e$ $\Rightarrow$ $\mathbf p^T \mathbf e = 0$. | * $\mathbf p \; \bot \; \mathbf e$ $\Rightarrow$ $\mathbf p^T \mathbf e = 0$. | ||
− | * Check: | + | * Check: <math>\begin{bmatrix} 7/6 & 5/3 & 13/6 \end{bmatrix} \begin{bmatrix} - 1/6 \\ 2/3 \\ -1/6 \end{bmatrix} = - 7/6 \cdot 1/6 + 5/3 \cdot 2/3 - 13/6 \cdot 1/6 = 0</math> |
Line 280: | Line 280: | ||
Let's apply SVD to $X$: | Let's apply SVD to $X$: | ||
* $X = U \Sigma V^T$, with $\text{dim } X = \text{dim } \Sigma$ | * $X = U \Sigma V^T$, with $\text{dim } X = \text{dim } \Sigma$ | ||
− | * | + | * <math>\begin{align} |
X \mathbf w - \mathbf y & = U \Sigma V^T \mathbf w - \mathbf y \\ | X \mathbf w - \mathbf y & = U \Sigma V^T \mathbf w - \mathbf y \\ | ||
& = U \Sigma V^T \mathbf w - U U^T \mathbf y \\ | & = U \Sigma V^T \mathbf w - U U^T \mathbf y \\ | ||
& = U (\Sigma V^T \mathbf w - U^T \mathbf y) \\ | & = U (\Sigma V^T \mathbf w - U^T \mathbf y) \\ | ||
− | \end{align} | + | \end{align}</math> |
* let $\mathbf v = V^T \mathbf w$ and $\mathbf z = U^T \mathbf y$ | * let $\mathbf v = V^T \mathbf w$ and $\mathbf z = U^T \mathbf y$ | ||
* then we have $U (\Sigma \mathbf v - \mathbf z)$ | * then we have $U (\Sigma \mathbf v - \mathbf z)$ |
$\require{cancel}$
Suppose we have
Thus we have a system
There's no solution to the system, so we try to fit the data as good as possible
So our problem is
In Linear algebra we typically use different notation
Suppose we have a matrix $A$ with out observations
Normal Equation:
When does $A^T A$ have no inverse?
Consider this example:
$A^T A = \begin{bmatrix} 1 & 1 & 1 \\ 3 & 3 & 3 \end{bmatrix} \begin{bmatrix} 1 & 3 \\ 1 & 3 \\ 1 & 3 \end{bmatrix} = \begin{bmatrix} 3 & 9 \\ 9 & 27 \end{bmatrix}$
In this case $\text{rank}(A) = 1$ and $\text{rank}(A^T A) = 1$ so $A^T A$ is not invertible
When it is invertible?
$(A^T A)$ may be not invertible if
Suppose we have the following dataset:
so we have this system:
Is this indeed the best straight line through these points?
Let's check if $\mathbf p \; \bot \; \mathbf e$
We can also verify that $\mathbf e \; \bot \; C(A)$
import matplotlib.pylab as plt import numpy as np class Line: def __init__(self, slope, intercept): self.slope = slope self.intercept = intercept def calculate(self, x1): x2 = x1 * self.slope + self.intercept return x2 A = np.array([[1, 1], [1, 2], [1, 3]]) b = np.array([1, 2, 2]) x0, x1 = np.linalg.solve(A.T.dot(A), A.T.dot(b)) lsq = Line(x1, x0) # figure plt.scatter(A[:, 1], b, marker='x', color='black') points = np.array([0.5, 3.5]) plt.plot(points, lsq.calculate(points)) plt.scatter(A[:, 1], lsq.calculate(A[:, 1]), marker='o', color='red') plt.vlines(A[:, 1], b, lsq.calculate(A[:, 1])) plt.show() x = np.array([[x0], [x1]]) p = A.dot(x).reshape(-1) e = p - b print p.dot(e)
Normal Equation:
How to speed up computation of $(X^T X)^{-1}$?
Let's apply SVD to $X$:
Orthogonal Matrices preserve the $L_2$-norm
So we reduced OLS Regression problem to a diagonal form
Minimization $\| \Sigma \mathbf v - \mathbf z\|$:
Summary:
Compact solution:
We find $\mathbf w$ by calculating $\mathbf w = (X^T X + \lambda E^*)^{-1} \cdot X^T \cdot y$
This is called Ridge Regression
Implementation in Octave
pinv(X' * X) * X' * y