# ML Wiki

Suppose we have a cost function $J$ and want to minimize it

• say it takes 2 parameters $\theta_0$ and $\theta_1$
• So we have $J(\theta_0, \theta_1)$ and want to find $\min_{\theta_0, \theta_1} J(\theta_0, \theta_1)$

Idea:

• start with some $(\theta_0, \theta_1)$ (say, $(0,0)$)
• keep changing $(\theta_0, \theta_1)$ to reduce $J(\theta_0, \theta_1)$ until we end up in minimum

In the pseudo code

• repeat until converges
• for $j = 0$ and $j = 1$
• $\theta_j = \theta_j - \alpha \cfrac{\partial}{\partial \theta_i} J(\theta_0, \theta_1)$

$\alpha$ is the learning rate, value that specifies how small are steps we take

### Simultaneous Update

Note that the update for $(\theta_0, \theta_1)$ has to be simultaneous. That is

• $\tau_0 = \theta_0 - \alpha \cfrac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)$
• $\tau_1 = \theta_1 - \alpha \cfrac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)$
• $\theta_0 = \tau_0$
• $\theta_1 = \tau_1$

As you see, $\theta_0$ is used to calculate new value for $\theta_1$, so we cannot update it before we calculate new value for $\theta_0$.

This is called simultaneous update

## Intuition

Let's see how it works

• assuming that there's only one variable $\theta_1$ and $\theta_1 \in \mathbb{R}$

$\theta_1 = \theta_1 - \alpha \cfrac{d}{d \theta_1} J(\theta_1)$

let's have a look at the partial derivative:

• $\beta = \alpha \cfrac{d}{d \theta_1} J(\theta_1)$

if the derivative is positive

• we're moving left:
$\theta_1 = \theta_1 - \beta$

if the derivative is negative

• we're moving right:
$\theta_1 = \theta_1 + \beta$

For two variables the cost function would look like that:

### Learning Rate

• when $\alpha$ is too small - we're taking very small steps - too slow
• when $\alpha$ is too large - we're taking too big steps and may miss the minimum
in this case not only may it fail to converge, but even diverge!

Approaching the minimum

• As we're approaching the local minimum, it takes smaller and smaller steps
• If $\theta_1$ is at the local minimum, then $\beta = 0$ and $\theta_1$ won't change

### Convex Function

The cost function $J$ has to be convex if we don't want to end up in a local minimum.

## Univarivate Linear Regression

• Given input data set $\{(x^{(i)}, y^{(i)}\}$ of size $m$
• We have our hypothesis $h(x) = \theta_0 + \theta_1 x$
• how to choose $\theta_0$ and $\theta_1$ so $h(x)$ is closest to the set of input data

We need to minimize the cost function:

• $J(\theta_1, \theta_2) = \cfrac{1}{2 m} \sum_{i = 1}{m} (h_0 x^{(i)} - y^{(i)} )^2$
• This is the squared error cost function

Let's simplify our expression:

$\cfrac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) = \cfrac{\partial}{\partial \theta_j} \cfrac{1}{2m} \sum (h_{\theta}(x^{i}) - y^{(i)} )^2 = \cfrac{\partial}{\partial \theta_j} \cfrac{1}{2m} \sum (\theta_0 + \theta_1 x^{i} - y^{(i)} )^2$

Now we calculate the derivatives and have:

• for $\theta_0$:
$\cfrac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \cfrac{1}{m} \sum (h_{\theta} (x^{(i)}) - y^{(i)})$
• for $\theta_1$:
$\cfrac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \cfrac{1}{m} \sum (h_{\theta} (x^{(i)}) - y^{(i)}) \cdot x^{(i)}$

So for the regression the algorithm is

• repeat until converges
• $\theta_0 = \theta_0 - \alpha \cfrac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \cfrac{1}{m} \sum (h_{\theta} (x^{(i)}) - y^{(i)})$
• $\theta_1 = \theta_1 - \alpha \cfrac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \cfrac{1}{m} \sum (h_{\theta} (x^{(i)}) - y^{(i)}) \cdot x^{(i)}$
• (update simultaneously)

The square error cost function is convex, so we always converge to the global minimum.

## Gradient Descent for Multivariate Linear Regression

For Multivariate Linear Regression we have $x^{(i)} \in \mathbb{R}^{n + 1}$and $\theta = \in \mathbb{R}^{n+1}$, where

• $n$ - is number of features
• $m$ - number of training examples
• and $x_0^{(i)} = 1$ for all $i$ (the slope)

So out cost function takes the following form:

$J(\theta) = J(\theta_0, ... , \theta_n) = \cfrac{1}{2m} \sum_{i = 1}^{m} (h_{\theta} (x^{(i)}) - y^{(i)} )^2$

The algorithm:

• repeat
• simultaneously for all $i$
• $\theta_j = \theta_j - \alpha \cfrac{\partial}{\partial \theta_j} J(\theta)$

or, having calculated the derivatives:

• repeat
• simultaneously for all $i$
• $\theta_j = \theta_j - \alpha \cfrac{1}{m} \sum (h_{\theta}) (x^{(i)}) - y^{(i)} ) \cdot x_j^{(i)}$

### Feature Scaling

Use Feature Scaling to help GD converge faster

### Learning Rate

• How to choose $\alpha$?
• If GD works properly, cost function should decrease after each iteration
• if $J$ decreases by less than $\epsilon = 10^{-3}$ in one iteration - declare convergence
• If $J$ is increasing instead - need to make $\alpha$ smaller

Choosing $\alpha$:

• But don't choose $\alpha$ too small - it'll take too long to converge
• To choose $\alpha$ try
$..., 0.001, 0.01, 0.1$, ... or increase it 3-fold
• And see what is acceptable

## Applications

Apart from Linear Regression, Gradient Descent may also be used for