- Suppose we have a large number of features and a complex structure of data
- For Logistic Regression we would probably need to fit a very high-order polynomial

Say we have 100 features ($n = 100$) and we want to fit, a multiplication of each pair of features

- i.e. we will have $x_1^2, x_1 x_2, ..., x_1 x_{100}, ... x_2^2, x_2 x_3, ..., $
- this gives us $\approx$ 5000 features (it grows as $O(n)$)
- for combinations of triples we'll have $\approx$ 170 000 features!

Next, suppose we have a computer vision problem: car detection

- we show it cars, then show it not cars, and then test
- Say we have 50 x 50 pixels image, 2500 pixels in total (7500 if RGB).
- If we want to fit polynomials, the number of features is too huge to do this!

So using Logistic Regression is certainly not a good way to handle lots of features, and here Neural Networks can help

This technique is based on how our brain works - it tries to mimic its behavior.

- A NN model is built from many
*neurons*- cells in the brain. - the neuron, called
*activation unit*, takes features as input - A network consists of many activation units

- the simplest activation unit is a Logistic Regression
- i.e. it is equivalent to Logistic Regression model

it's called *sigmoid (logistic) activation function*:

Bias unit

- $x_0$ is a bias unit, $x_0 = 1$ always
- $x_0$ may be omitted from a picture, but it's usually assumed in these cases

Weights

- arrows are "input wires"
- so this unit takes $x = [x_0, x_1, x_2, x_3]^T$,
- and the wires are out parameters $\theta = [\theta_0, \theta_1, \theta_2, \theta_3]^T$ - their are called
*weights*

Result

- it applies the sigmoid function to the input, and as the result, it returns
- $h_{\theta} = g(\theta^T x) = \cfrac{1}{1 + e^{-\theta^T x}}$

Let's have a look at an actual neural network

- A NN model is a network of many sigmoid activation units, organized in
*layers*- where the next layer's input is the current layer's output

- the first layer is
*input layer*, called $x$ - it takes our feature vector $x = [x_1, ..., x_n]^T$ - the last layer (3rd on the picture) is an output layer, it gives us the final result
- all layers in between are called
*hidden layers* - (note that bias units $x_0$ and $a_0^{(2)}$ are omitted from the picture, but they are there)

We'll have the following notation:

- $a_i^{(j)}$ is an
*activation of unit $i$ in layer $j$* - $\theta^{(j)}$ - matrix of
*weights*that control mapping from layer $j$ to $j + 1$ (i.e. $\theta_1$ is the parameters of the 2nd layer and so on) - Neural Networks are parametrized by $\theta$s

Mathematical representation of a neural network is (where $g$ is the sigmoid function)

- $a_1^{(2)} = g(\theta_{10}^{(1)} x_0 + \theta_{11}^{(1)} x_1 + \theta_{12}^{(1)} x_2 + \theta_{13}^{(1)} x_3)$
- $a_2^{(2)} = g(\theta_{20}^{(1)} x_0 + \theta_{21}^{(1)} x_1 + \theta_{22}^{(1)} x_2 + \theta_{23}^{(1)} x_3)$
- $a_3^{(2)} = g(\theta_{30}^{(1)} x_0 + \theta_{31}^{(1)} x_1 + \theta_{32}^{(1)} x_2 + \theta_{33}^{(1)} x_3)$
- and $h_{\theta}(x) = a_1^{(3)} = g(\theta_{10}^{(2)} a_0^{(2)} + \theta_{11}^{(2)} a_1^{(2)} + \theta_{12}^{(2)} a_2^{(2)} + \theta_{13}^{(2)} a_3^{(2)})$

so we have

- 3 input units $x_1, x_2, x_3$ (plus bias $x_0 = 1$)
- 3 hidden units in 1 hidden layer $a_1^{(2)}, a_2^{(2)}, a_3^{(2)}$ in layer 2 (plus bias $a_0^{(2)} = 1$)

In this example the dimension of $\theta^{(1)}$ is $\theta^{(1)} \in \mathbb{R}^{3 \times 4}$

General rule

- if a network has $s_j$ units in layer $j$ and $s_{j + 1}$ units in layer $j + 1$ , then
- $\theta^{(i)} \in \mathbb{R}^{s_{j + 1} \times (s_{j} + 1)}$ (i.e. it has dimension $s_{j + 1} \times (s_{j} + 1)$)

For the first step we have

- $a_1^{(2)} = g(z_1^{(2)})$ where $z_1^{(2)} = \theta_{10}^{(1)} x_0 + \theta_{11}^{(1)} x_1 + \theta_{12}^{(1)} x_2 + \theta_{13}^{(1)} x_3$
- $a_2^{(2)} = g(z_2^{(2)})$, and
- $a_3^{(2)} = g(z_3^{(2)})$
- ($z_1^{(2)}, z_2^{(2)}, z_3^{(2)}$ - are
*linear combinations*of $x_1, x_2, x_3$)

So we have 3 vectors

- $\theta^{(0)} = [\theta_0^{(0)}, \theta_1^{(0)}, \theta_2^{(0)}, \theta_3^{(0)}]^T$,
- $x = [x_0, x_1, x_2, x_3]^T$,
- $z^{(2)} = [z_1^{(2)}, z_2^{(2)}, z_3^{(2)}]^T$

And can rewrite the first step in a vectorized form:

- instead of $z_1^{(2)} = \theta_{10}^{(1)} x_0 + \theta_{11}^{(1)} x_1 + \theta_{12}^{(1)} x_2 + \theta_{13}^{(1)} x_3$, we write $z_1^{(2)} = \theta_1^{(1)} \cdot x$

- 1st step
- $z^{(2)} = \theta^{(1)} \cdot x = \theta^{(1)} \cdot a^{(1)}$ and
- $a^{(2)} = g(z^{(2)}) \in \mathbb{R}^3$

- next step
- we add $a_0^{(2)} = 1$, so $a^{(2)} = [1, a_1^{(2)}, a_2^{(2)}, a_3^{(2)}]^T \in \mathbb{R}^{4}$
- and calculate $z^{(3)} = \theta^{(2)} \cdot a^{(2)}$

- finally
- $h_{\theta}(x) = a^{(3)} = g(z^{(3)})$

This process is called *forward propagation*

Let's have a look at the 2nd and 3rd layers of our NN

- $h_{\theta} = g((\theta^{(2)})^T \cdot a^{(2)})$, and $a^{(2)}$ is given by the 2nd level units
- so it's doing a logistic regression, but it uses $a^{(2)} = [a_0^{(2)} ... a_3^{(2)}]$ for features (instead of $x$s)
- and features $a^{(2)} = [a_0^{(2)} ... a_3^{(2)}]$ are themselves learned by the previous layer

- We can create NNs with as many layers as we want
- The way the neurons are connected is called
*architecture*

What to do if we what to use it for multi-class classification?

- We can have multiple output units! (similar to One-vs-All Classification)

So we want

- $h_{\theta}(x) \approx \left[\begin{matrix} 1 \\ 0 \\ 0 \\ 0\end{matrix} \right]$ if an item belongs to 1st category
- $h_{\theta}(x) \approx \left[\begin{matrix} 0 \\ 1 \\ 0 \\ 0\end{matrix} \right]$ if it belongs to 2nd category
- and so on

For training set ${(x^{(i)}, y^{(i)})}$

- we turn $y$ into one of $\left\{ \left[\begin{matrix} 1 \\ 0 \\ 0 \\ 0\end{matrix} \right], \left[\begin{matrix} 0 \\ 1 \\ 0 \\ 0\end{matrix} \right], \left[\begin{matrix} 0 \\ 0 \\ 1 \\ 0\end{matrix} \right], \left[\begin{matrix} 0 \\ 0 \\ 0 \\ 1\end{matrix} \right] \right\}$ - instead of $y \in \{1, 2, 3, 4\}$,
- so, when training, we would like to have $h_{\theta}(x^{(i)}) \approx y^{(i)} \in \mathbb{R}^4$
- then we select the class with highest $h_{\theta}^{(i)}(x^{(i)})$, as in One-vs-All Classification

Suppose we have $m$ training examples $\{(x^{(i)}, y^{(i)})\}$

- $L$ - total number of layers in network
- $S_l$ - # of units (without bias) in layer $l$
- $K = S_L$ - number of units the output layer, i.e. the network classifies $K$ classes

output $h_{\theta}(x) \in \mathbb{R}^{K}$

For example,

- $L = 3$
- $S_1 = 3, S_2 = 4, K = S_3 = 3$

For Logistic Regression with Regularization we have the following cost function:

$$J(\theta) = -\cfrac{1}{m} \sum \Big[ y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)})) \Big] + \cfrac{\lambda}{2m} \sum_{j = 1}^{n} \theta_j^2$$

In neural networks

- $h_{\theta}(x) \in \mathbb{R}^K$
- let $h_{\theta}(x)_i$ - be $i$th output

So we have

$$J(\theta) = -\cfrac{1}{m} \sum_{i = 1}^{m} \sum_{k = 1}^{K} \Big[ y_k^{(i)} \log h_{\theta}(x^{(i)})_k + (1 - y_k^{(i)}) \log (1 - h_{\theta}(x^{(i)})_k ) \Big] + \cfrac{\lambda}{2m} \sum_{l = 1}^{L - 1} \sum_{i = 1}^{S_l} \sum_{j = 1}^{S_{l + 1}} (\theta_{ji}^{l})^2$$

here we also don't regularize bias inputs

- we need to find such $\theta$ that $J(\theta)$ is minimal
- for that we can use Gradient Descent or other advanced optimization techniques
- for GD we need to compute partial derivative $\cfrac{\partial}{\partial \theta_{ij}^{(l)}} J(\theta)$ with respect to each $\theta_{ij}^{(l)}$

*Back Propagation* is a technique for calculating partial derivatives in neural networks

suppose we have a training example $(x, y)$

- To compute cost $J(\theta)$ we use Forward Propagation (vectorized)

- $a^{(1)} = x$
- $z^{(2)} = \theta^{(1)} \cdot a^{(1)}$
- $a^{(2)} = g(z^{(2)})$ (plus adding $a_0^{(2)} = 1$)
- $z^{(3)} = \theta^{(2)} \cdot a^{(2)}$
- $a^{(3)} = g(z^{(3)})$ (plus adding $a_0^{(3)} = 1$)
- $z^{(4)} = \theta^{(3)} \cdot a^{(4)}$
- $a^{(4)} = g(z^{(4)})$ (plus adding $a_0^{4)} = 1$)

To compute derivatives we use Back Propagation

- let $\delta_j^{(l)} $be "error" of node $j$ in layer $i$ (for $a_j^{(l)}$)
- for each output unit we compute
- $\delta_k^{(L)} = a_j^{(L)} - y_j = h_{\theta}(x)_j - y_j$
- or, vectorized:

- $\delta^{(L)} = a^{(L)} - y$

next, we compute $\delta$ for all earlier layers

- $\delta^{(3)} = (\theta^{(3)})^T \cdot \delta^{(4)} * g'(z^{(3)})$
- where
- $*$ - element-wise product
- $g'(z^{(3)})$ - derivative of function $g$ in $z^{(3)}$
- $g'(z^{(3)}) = a^{(3)} * (1 - a^{(3)})$

- $\delta^{(2)} = (\theta^{(2)})T \cdot \delta^{(3)} .* g'(z^{(2)})$
- We don't compute anything for the first layer - it's the input layer, and there can be no errors

Our partial derivative is

- $\cfrac{\partial}{\partial \theta_{ij}^{(l)}} J(\theta) = a_j^{(l)} \cdot \delta_i^{(l + 1)}$
- (here we ignore regularization parameter $\lambda$ - i.e. we assume no regularization at the moment)

We have training set $\{(x^{(i)}, y^{(i)}\}$ with $m$ training examples

Set $\Delta_{ij}^{(l)} \leftarrow 0$ for all $l$, $i$, $j$ (it's used to compute$ \cfrac{\partial}{\partial \theta_{ij}^{(l)}} J(\theta)$ )

For each $\{(x^{(i)}, y^{(i)}\}$:

- set $a^{(1)} = x^{(i)}$
- perform Forward Propagation to compute $a^{(l)}$ for $l = 2, 3, ..., L$
- perform Back Propagation
- Using $y^{(i)}$ compute $\delta^{(L)} = a^{(L)} - y^{(i)}$
- Then compute
- all $\delta^{(L - 1)}, \delta^{(L - 2)}, ..., \delta^{(2)}$
- Set $\Delta_{ij}^{(l)} \leftarrow \Delta_{ij}^{(l)} + a_j^{(l)} \delta_i^{(l + 1)} $
- or, vectorized:

- $\Delta^{(l)} \leftarrow \Delta^{(l)} + \delta^{(l + 1)} (a^{(l)})^T$

- Then we calculate
- $D_{ij}^{(l)} \leftarrow \cfrac{1}{m} \delta_{ij}^{(l)} + \lambda \theta_{ij}^{(l)}$ if $j \ne 0$
- $D_{ij}^{(l)} \leftarrow \cfrac{1}{m} \delta_{ij}^{(l)}$ if $j = 0$

- That value can we used for GD:

- $\cfrac{\partial}{\partial \theta_{ij}^{(l)}} = D_{ij}^{(l)}$

To sum up, for each training example we

- propagate forward using $x$
- we back-propagate using $y$
- add up all error units (for each $\theta$) into matrix $\Delta$

Say we have a single training example $(x, y)$ and 1 output unit and we ignore regularization

$\text{cost}(x, y) = y \log h_{\theta}(x) + (1 - y) \log h_{\theta}(x)$

- it calculates how well is the network doing for that training example - how close it is to $y$?
- $\delta_j^{(l)}$ = "error" of cost for $a_j^{(l)}$ (unit $j$ in layer $l$)

formally,

- $\delta_j^{(l)} = \cfrac{\partial}{\partial z_j^{(l)}} \text{cost}(x, y)$
- it's a partial derivative with respect to $z_j^{(l)}$
- they are measures how much we would like to change the NN weights in order
- to affect intermediate values $z^{(2)}, z^{(3)}, ...,$
- and the final output $z^{(4)}$,
- and therefore, overall cost

Implementing back-propagation can be hard and bug-prone, but we can perform Numerical Gradient Checking to test our implementation

suppose $\theta \in \mathbb{R}$

- We can estimate real slope of the derivative by calculating $J(\theta + \epsilon) - J(\theta - \epsilon)$ for small $\epsilon$
- $\cfrac{d}{d \theta} J(\theta) \approx \cfrac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon}$
- And that will give us a numerical estimate of the gradient and that point

when $\theta \in \mathbb{R}^n$

- for partial derivative with respect to each $\theta_i$ we calculate
- $\cfrac{\partial}{\partial \theta_1} J(\theta) \approx \cfrac{J(\theta_1 + \epsilon, \theta_2, ..., \theta_n) - J(\theta_1 - \epsilon, \theta_2, ..., \theta_n)}{2\epsilon}$
- $\cfrac{\partial}{\partial \theta_2} J(\theta) \approx \cfrac{J(\theta_1, \theta_2 + \epsilon, ..., \theta_n) - J(\theta_1, \theta_2 - \epsilon, ..., \theta_n)}{2\epsilon}$
- ...
- $\cfrac{\partial}{\partial \theta_2} J(\theta) \approx \cfrac{J(\theta_1, \theta_2, ..., \theta_n + \epsilon) - J(\theta_1, \theta_2, ..., \theta_n - \epsilon)}{2\epsilon}$

- i.e. we add and subtract only values for $\theta_i$ we calculate derivative for
- this gives us a way to numerically estimate all partial derivatives
- and then we check if this numerical estimate $\approx$ the derivative from back propagation

To implement Back Propagation use the following approach:

- Implement back propagation to compute $D^{(1)}, D^{(2)}, ...$
- Implement numerical gradient check to compute approximations of partial derivatives
- Make sure they have similar values
- Turn off gradient checking, use only back propagation for learning (it's much more computationally efficient)

- We need to have initial values for $\theta$
- In Logistic Regression we used $\theta = [0, 0, ..., 0]^T$
- It won't work for NNs

Suppose we set all $\theta_{ij}^{(l)}$ to 0

- Then we'll have same values for $a_1^{(2)} = a_2^{(2)} = ... = a_{s_2}^{(2)}$ and same for $\delta^{(2)}$

- (after each update, parameters corresponding to inputs are identical, and all hidden units will compute the same value)

- And therefore, all partial derivatives will also we equal
- This is called
*the problem of symmetric weights*

We can break the symmetry with random initialization

- so, initialize each $\theta_{ij}^{(l)}$ with random value from $[-\epsilon; \epsilon]$:
- $\theta_{ij}^{(l)} \leftarrow_{r} [-\epsilon; \epsilon]$

- Randomly initialize weights $\theta$
- Implement forward propagation to get $h_{\theta}(x^{(i)})$ for any $x^{(i)}$
- Implement code to compute cost function $J(\theta)$
- Implement back propagation to compute partial derivatives $\cfrac{\partial}{\partial \theta_{ij}^{(l)}} J(\theta)$
- Use gradient checking to compare numerical estimations of partial derivatives vs values from back propagation
- Use Gradient Descent or another optimization technique to minimize $J(\theta)$

**NB**: $J(\theta)$ in non-convex and can get stuck in local minimum - but usually it's not a problem

- Cost Function (both forward propagation and back propagation)
- One-vs-All Prediction for NN
- Numerical Gradient Check