# ML Wiki

## Test Errors

How you can tell that a hypothesis overfits?

• plotting - not always good

We can split all the data into 2 subsets

• training set $\approx$ 70% of data, $m$ - number of examples in the training set
• testing set $\approx$ 30% of data, $m_{\text{test}}$ - number of examples in the testing set

it's better to choose examples for training/testing sets randomly

Error Metrics
Prediction Classification
Example Model Linear Regression Logistic Regression
Test Error $J_{\text{test}}(\theta) = \cfrac{1}{m_{\text{test}}} \sum \text{error} \big(h_{\theta}(x_{\text{test}}^{(i)}), y_{\text{test}}^{(i)} \big)$
$\text{error}(h_{\theta}(x), y)$ Average Square Error
$\text{error}(h_{\theta}(x), y) = \cfrac{1}{2} (h_{\theta}(x) - y)^2$
Misclassification Error
$\text{error}(h_{\theta}(x), y) = \left\{\begin{array}{l} 0 \text { if classification is correct} \\ 1 \text{ otherwise} \end{array}\right.$

## Cross-Validation

Generally cross-validation is used to find the best value of some parameter

• we still have training and test sets
• but additionally we have a cross-validation set to test the performance of our model depending on the parameter

### Motivation: Model Selection Problem

We have a following problem of Model Selection

Suppose we are about to create a model and not sure what degree of polynomial to choose

So the problem

• So we want to try 10 models
• let $d$ denote the degree of polynomial
• $d=1: h_{\theta} = \theta_0 + \theta_1 x$
• $d=2: h_{\theta} = \theta_0 + \theta_1 x + \theta_2 x^2$
• ...
• $d=10: h_{\theta} = \theta_0 + \theta_1 x + \theta_2 x^2 + ... + \theta_{10} x^{10}$

Results

• Each $d$ will give us a vector (or matrix) $\theta^{(d)}$
• Now we can try all $d$ models and see which gives the best (lowest) $J_{\text{test}}(\theta^{(i)})$
• Let's say we decided to choose 5th model

How well this model generalize?

• the test error is $J_{\text{test}}(\theta^{(5)})$
• but $J_{\text{test}}(\theta^{(5)})$ is a very optimistic estimate of the generalization error
• (because the lowest/best error was picked up - so the test error might be biased towards that)
• i.e. we the extra parameter $d$ was fit to test, and it's not fair to estimate our hypothesis on the test set, because we already used that set to get the best $d$

• instead of splitting the data set into 2 categories, we split into 3 sets:
• training set ($\approx$ 60%)
• $x^{(i)}, y^{(i)}$, total $m$ examples
• cross-validation set (or cv, $\approx$ 20%)
• $x_{\text{cv}}^{(i)}, y_{\text{cv}}^{(i)}$, total $m_{\text{cv}}$ examples
• test set ($\approx$ 20%)
• $x_{\text{test}}^{(i)}, y_{\text{test}}^{(i)}$, total $m_{\text{test}}$ examples

Now we can define

• Training error
$J(\theta) = J_{\text{train}}(\theta) = \cfrac{1}{2m} \sum \text{cost}(x^{(i)}, y^{(i)})$
• Cross-Validation error
$J_{\text{cv}}(\theta) = \cfrac{1}{2m_{\text{cv}}} \sum \text{cost}(x_{\text{cv}}^{(i)}, y_{\text{cv}}^{(i)})$
• Test Error
$J_{\text{test}}(\theta) = \cfrac{1}{2m_{\text{test}}} \sum \text{cost}(x_{\text{test}}^{(i)}, y_{\text{test}}^{(i)})$

So for Model Selection to fit $d$, we

• obtain $\theta^{(1)}, ..., \theta^{(d)}$ and select best (lowest) $J_{\text{cv}}(\theta^{(i)})$
• estimate generalization error for the test set $J_{\text{test}}(\theta^{(i)})$

### Tuning Learning Parameter

General algorithm

• Split data set into
• Learning set
• Validation set
• Test set
• use validation set for tuning control parameters
• use test set only for final evaluation

Let $\gamma$ be the control parameter

• for every possible value of $\gamma$
• build a classification model $C$ using the learning set
• use the validation set to estimate the expected error rate of $C$
• select the optimal value of the control parameter
• that is the value of $\gamma$ s.t. the error rate is minimal

### Regularization Parameter $\lambda$

Suppose now we're fitting a model with high-order polynomial

• $h_{\theta}(x) = \theta_0 + \theta_1 x + ... + \theta_4 x^4$
• to prevent overfitting we use regularization
• $J(\theta) = \cfrac{1}{m} \sum \text{cost}(h_{\theta}(x^{(i)}), y^{(i)}) + \cfrac{\lambda}{2m} \sum_{j=1}^{m} \theta_j^2$ • (a) If $\lambda$ if large (say $\lambda = 10000$), all $\theta$ are penalized and $\theta_1 \approx \theta_2 \approx ... \approx 0$, $h_{\theta}(x) \approx \theta_0$
• (b) if $\lambda$ is intermediate, we fit well
• (c) if $\lambda$ is small (close to 0) we fit too well, i.e. we overfit

How can we chose good $\lambda$? Let's define

• $J_{\text{train}}(\theta) = \cfrac{1}{m} \sum \text{cost}(h_{\theta}(x^{(i)}), y^{(i)})$ (same as $J(\theta)$, but without regularization)
• $J_{\text{cv}}(\theta)$ and $J_{\text{test}}(\theta)$ - same, but on cross-validation and test datasets respectively

Now we

• Choose a range of possible values for $\lambda$ (say 0, 0.01, 0.02, 0.04, ..., 10.24) - that gives us 12 models to checks
• For each $\lambda^{(i)}$,
• calculate $\theta^{(i)}$,
• calculate $J_{\text{cv}}(\theta^{(i)})$,
• and take $\lambda^{(i)}$ with lowest $J_{\text{cv}}(\theta^{(i)})$

After that we report the test error $J_{\text{test}}(\theta^{(i)})$

## K-Fold Cross-Validation

$K$-Fold Cross-Validation If we want to reduce variability in the data

• we can perform multiple rounds of cross-validation using different partitions
• and then average the results over all the rounds

We're given a dataset $S$ sampled from the population $D$

• we partition $S$ into $K$ equal disjoint subsets $(T_1, ..., T_K)$ (typically 5-10 subsets)
• then perform $K$ steps, and at step $k$ do:
• use $R_k = S - T_k$ as the training set
• build classifier $C_k$ using $R_k$
• use $T_k$ as the test test, compute error $\delta_k = \text{error}(C_k, T_k)$
• let $\delta^* = \cfrac{1}{K} \sum_{k = 1}^K \delta_k$
• this is the expected error rate • note that there's only two subsets at each iteration
• and they are used to estimate the true error

### Tuning Learning Parameter

Choosing best value for parameter $\gamma$ with $K$-Fold Cross-Validation

• you have to put some data aside before training your classifier
• i.e. split your data into learning set and test set
• for every possible value of $\gamma$
• use $K$-fold Cross-Validation to estimate the expected error rate
• use the learning set as the set $S$ (and do $K$-fold Cross-Validation only on it)
• select the optimal value of $\gamma$
• and then use the test set for the final evaluation

### Stratified K-Fold Cross-Validation

What if we want to preserve the class distribution over $K$ runs?

• then for each $T_i$ pick up the same proportion of labels as in the original dataset
• • it's very important if the test distribution is not uniform