Gradient Descent
Suppose we have a cost function $J$ and want to minimize it
- say it takes 2 parameters $\theta_0$ and $\theta_1$
- So we have $J(\theta_0, \theta_1)$ and want to find $\min_{\theta_0, \theta_1} J(\theta_0, \theta_1)$
Idea:
- start with some $(\theta_0, \theta_1)$ (say, $(0,0)$)
- keep changing $(\theta_0, \theta_1)$ to reduce $J(\theta_0, \theta_1)$ until we end up in minimum
In the pseudo code
- repeat until converges
- for $j = 0$ and $j = 1$
- $\theta_j = \theta_j - \alpha \cfrac{\partial}{\partial \theta_i} J(\theta_0, \theta_1)$
$\alpha$ is the learning rate, value that specifies how small are steps we take
Simultaneous Update
Note that the update for $(\theta_0, \theta_1)$ has to be simultaneous. That is
- $\tau_0 = \theta_0 - \alpha \cfrac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)$
- $\tau_1 = \theta_1 - \alpha \cfrac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)$
- $\theta_0 = \tau_0$
- $\theta_1 = \tau_1$
As you see, $\theta_0$ is used to calculate new value for $\theta_1$, so we cannot update it before we calculate new value for $\theta_0$.
This is called simultaneous update
Intuition
Let's see how it works
- assuming that there's only one variable $\theta_1$ and $\theta_1 \in \mathbb{R}$
$\theta_1 = \theta_1 - \alpha \cfrac{d}{d \theta_1} J(\theta_1)$
let's have a look at the partial derivative:
- $\beta = \alpha \cfrac{d}{d \theta_1} J(\theta_1)$
if the derivative is positive
-
- we're moving left:
- $\theta_1 = \theta_1 - \beta$
if the derivative is negative
-
- we're moving right:
- $\theta_1 = \theta_1 + \beta$
For two variables the cost function would look like that:
-
Learning Rate
- when $\alpha$ is too small - we're taking very small steps - too slow
- when $\alpha$ is too large - we're taking too big steps and may miss the minimum
- in this case not only may it fail to converge, but even diverge!
Approaching the minimum
- As we're approaching the local minimum, it takes smaller and smaller steps
- If $\theta_1$ is at the local minimum, then $\beta = 0$ and $\theta_1$ won't change
Convex Function
The cost function $J$ has to be convex if we don't want to end up in a local minimum.
- Given input data set $\{(x^{(i)}, y^{(i)}\}$ of size $m$
- We have our hypothesis $h(x) = \theta_0 + \theta_1 x$
- how to choose $\theta_0$ and $\theta_1$ so $h(x)$ is closest to the set of input data
We need to minimize the cost function:
- $J(\theta_1, \theta_2) = \cfrac{1}{2 m} \sum_{i = 1}{m} (h_0 x^{(i)} - y^{(i)} )^2$
- This is the squared error cost function
Let's simplify our expression:
- $\cfrac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) = \cfrac{\partial}{\partial \theta_j} \cfrac{1}{2m} \sum (h_{\theta}(x^{i}) - y^{(i)} )^2 = \cfrac{\partial}{\partial \theta_j} \cfrac{1}{2m} \sum (\theta_0 + \theta_1 x^{i} - y^{(i)} )^2 $
Now we calculate the derivatives and have:
- $\cfrac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \cfrac{1}{m} \sum (h_{\theta} (x^{(i)}) - y^{(i)})$
- $\cfrac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \cfrac{1}{m} \sum (h_{\theta} (x^{(i)}) - y^{(i)}) \cdot x^{(i)}$
So for the regression the algorithm is
- repeat until converges
- $\theta_0 = \theta_0 - \alpha \cfrac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \cfrac{1}{m} \sum (h_{\theta} (x^{(i)}) - y^{(i)})$
- $\theta_1 = \theta_1 - \alpha \cfrac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \cfrac{1}{m} \sum (h_{\theta} (x^{(i)}) - y^{(i)}) \cdot x^{(i)}$
- (update simultaneously)
The square error cost function is convex, so we always converge to the global minimum.
For Multivariate Linear Regression we have $x^{(i)} \in \mathbb{R}^{n + 1} $and $\theta = \in \mathbb{R}^{n+1}$, where
- $n$ - is number of features
- $m$ - number of training examples
- and $x_0^{(i)} = 1$ for all $i$ (the slope)
So out cost function takes the following form:
- $J(\theta) = J(\theta_0, ... , \theta_n) = \cfrac{1}{2m} \sum_{i = 1}^{m} (h_{\theta} (x^{(i)}) - y^{(i)} )^2$
The algorithm:
- repeat
- simultaneously for all $i$
- $\theta_j = \theta_j - \alpha \cfrac{\partial}{\partial \theta_j} J(\theta)$
or, having calculated the derivatives:
- repeat
- simultaneously for all $i$
- $\theta_j = \theta_j - \alpha \cfrac{1}{m} \sum (h_{\theta}) (x^{(i)}) - y^{(i)} ) \cdot x_j^{(i)}$
Gradient Descent in Practice
Use Feature Scaling to help GD converge faster
Learning Rate
- How to choose $\alpha$?
- If GD works properly, cost function should decrease after each iteration
- if $J$ decreases by less than $\epsilon = 10^{-3}$ in one iteration - declare convergence
- If $J$ is increasing instead - need to make $\alpha$ smaller
Choosing $\alpha$:
- But don't choose $\alpha$ too small - it'll take too long to converge
- To choose $\alpha$ try
- $..., 0.001, 0.01, 0.1$, ... or increase it 3-fold
- And see what is acceptable
Applications
Apart from Linear Regression, Gradient Descent may also be used for
Sources