m (Alexey moved page OLS Regression to Ordinary Least Squares) |
m |
||
Line 11: | Line 11: | ||
* $n$ features, $\mathbf x_i = \big[x_{i1}, \ ... \ , x_{in} \big]^T \in \mathbb{R}^n$ | * $n$ features, $\mathbf x_i = \big[x_{i1}, \ ... \ , x_{in} \big]^T \in \mathbb{R}^n$ | ||
* We can put all such $\mathbf x_i$ as rows of a matrix $X$ (sometimes called a ''design matrix'') | * We can put all such $\mathbf x_i$ as rows of a matrix $X$ (sometimes called a ''design matrix'') | ||
− | * | + | * <math>X = \begin{bmatrix} |
- \ \mathbf x_1^T - \\ | - \ \mathbf x_1^T - \\ | ||
\vdots \\ | \vdots \\ | ||
Line 19: | Line 19: | ||
& \ddots & \\ | & \ddots & \\ | ||
x_{m1} & \cdots & x_{mn} \\ | x_{m1} & \cdots & x_{mn} \\ | ||
− | \end{bmatrix} | + | \end{bmatrix}</math> |
− | * the observed values: | + | * the observed values: <math>\mathbf y = \begin{bmatrix} |
y_1 \\ \vdots \\ y_m | y_1 \\ \vdots \\ y_m | ||
− | \end{bmatrix} \in \mathbb{R}^{m} | + | \end{bmatrix} \in \mathbb{R}^{m}</math> |
* Thus, we expressed our problem in the matrix form: $X \mathbf w = \mathbf y$ | * Thus, we expressed our problem in the matrix form: $X \mathbf w = \mathbf y$ | ||
* Note that there's usually additional feature $x_{i0} = 1$ - the slope, | * Note that there's usually additional feature $x_{i0} = 1$ - the slope, | ||
− | ** so $\mathbf x_i \in \mathbb{R}^{n+1}$ and | + | ** so $\mathbf x_i \in \mathbb{R}^{n+1}$ and <math>X = \begin{bmatrix} |
- \ \mathbf x_1^T - \\ | - \ \mathbf x_1^T - \\ | ||
- \ \mathbf x_2^T - \\ | - \ \mathbf x_2^T - \\ | ||
Line 35: | Line 35: | ||
& & \ddots & \\ | & & \ddots & \\ | ||
x_{m0} & x_{m1} & \cdots & x_{mn} \\ | x_{m0} & x_{m1} & \cdots & x_{mn} \\ | ||
− | \end{bmatrix} \in \mathbb R^{m \times n + 1} | + | \end{bmatrix} \in \mathbb R^{m \times n + 1}</math> |
Line 77: | Line 77: | ||
Suppose we have the following dataset: | Suppose we have the following dataset: | ||
* ${\cal D} = \{ (1,1), (2,2), (3,2) \}$ | * ${\cal D} = \{ (1,1), (2,2), (3,2) \}$ | ||
− | * the matrix form is | + | * the matrix form is <math>\begin{bmatrix} |
1 & 1\\ | 1 & 1\\ | ||
1 & 2\\ | 1 & 2\\ | ||
Line 87: | Line 87: | ||
\begin{bmatrix} | \begin{bmatrix} | ||
1 \\ 2 \\ 2 | 1 \\ 2 \\ 2 | ||
− | \end{bmatrix} | + | \end{bmatrix}</math> |
* no line goes through these points at once | * no line goes through these points at once | ||
* so we solve $X^T X \mathbf{\hat w} = X^T \mathbf y$ | * so we solve $X^T X \mathbf{\hat w} = X^T \mathbf y$ | ||
− | * | + | * <math>\begin{bmatrix} |
1 & 1 & 1 \\ | 1 & 1 & 1 \\ | ||
1 & 2 & 3 \\ | 1 & 2 & 3 \\ | ||
Line 100: | Line 100: | ||
3 & 6\\ | 3 & 6\\ | ||
6 & 14\\ | 6 & 14\\ | ||
− | \end{bmatrix} | + | \end{bmatrix}</math> |
* this system is invertible, so we solve it and get $\hat w_0 = 2/3, \hat w_1 = 1/2$ | * this system is invertible, so we solve it and get $\hat w_0 = 2/3, \hat w_1 = 1/2$ | ||
* thus the best line is $h(t) = w_0 + w_1 t = 2/3 + 1/2 t$ | * thus the best line is $h(t) = w_0 + w_1 t = 2/3 + 1/2 t$ |
Suppose we have
Thus we have a system
There's no solution to the system, so we try to fit the data as good as possible
The solution:
From the Linear Algebra point of view:
Alternatively, we can use Gradient Descent:
Suppose we have the following dataset: