Test Errors
How you can tell that a hypothesis overfits?
 plotting  not always good
We can split all the data into 2 subsets
 training set $\approx$ 70% of data, $m$  number of examples in the training set
 testing set $\approx$ 30% of data, $m_{\text{test}}$  number of examples in the testing set
it's better to choose examples for training/testing sets randomly
Error Metrics

Prediction 
Classification

Example Model

Linear Regression 
Logistic Regression

Test Error

$J_{\text{test}}(\theta) = \cfrac{1}{m_{\text{test}}} \sum \text{error} \big(h_{\theta}(x_{\text{test}}^{(i)}), y_{\text{test}}^{(i)} \big)$

$\text{error}(h_{\theta}(x), y)$

Average Square Error $\text{error}(h_{\theta}(x), y) = \cfrac{1}{2} (h_{\theta}(x)  y)^2$

Misclassification Error $\text{error}(h_{\theta}(x), y) = \left\{\begin{array}{l} 0 \text { if classification is correct} \\ 1 \text{ otherwise} \end{array}\right.$

CrossValidation
Generally crossvalidation is used to find the best value of some parameter
 we still have training and test sets
 but additionally we have a crossvalidation set to test the performance of our model depending on the parameter
We have a following problem of Model Selection
 Suppose we are about to create a model and not sure what degree of polynomial to choose
So the problem
 So we want to try 10 models
 let $d$ denote the degree of polynomial
 $d=1: h_{\theta} = \theta_0 + \theta_1 x$
 $d=2: h_{\theta} = \theta_0 + \theta_1 x + \theta_2 x^2$
 ...
 $d=10: h_{\theta} = \theta_0 + \theta_1 x + \theta_2 x^2 + ... + \theta_{10} x^{10}$
Results
 Each $d$ will give us a vector (or matrix) $\theta^{(d)}$
 Now we can try all $d$ models and see which gives the best (lowest) $J_{\text{test}}(\theta^{(i)})$
 Let's say we decided to choose 5th model
How well this model generalize?
 the test error is $J_{\text{test}}(\theta^{(5)})$
 but $J_{\text{test}}(\theta^{(5)})$ is a very optimistic estimate of the generalization error
 (because the lowest/best error was picked up  so the test error might be biased towards that)
 i.e. we the extra parameter $d$ was fit to test, and it's not fair to estimate our hypothesis on the test set, because we already used that set to get the best $d$
To address that problem
 instead of splitting the data set into 2 categories, we split into 3 sets:
 training set ($\approx$ 60%)
 $x^{(i)}, y^{(i)}$, total $m$ examples
 crossvalidation set (or cv, $\approx$ 20%)
 $x_{\text{cv}}^{(i)}, y_{\text{cv}}^{(i)}$, total $m_{\text{cv}}$ examples
 test set ($\approx$ 20%)
 $x_{\text{test}}^{(i)}, y_{\text{test}}^{(i)}$, total $m_{\text{test}}$ examples
Now we can define
 $J(\theta) = J_{\text{train}}(\theta) = \cfrac{1}{2m} \sum \text{cost}(x^{(i)}, y^{(i)})$
 $J_{\text{cv}}(\theta) = \cfrac{1}{2m_{\text{cv}}} \sum \text{cost}(x_{\text{cv}}^{(i)}, y_{\text{cv}}^{(i)})$
 $J_{\text{test}}(\theta) = \cfrac{1}{2m_{\text{test}}} \sum \text{cost}(x_{\text{test}}^{(i)}, y_{\text{test}}^{(i)})$
So for Model Selection to fit $d$, we
 obtain $\theta^{(1)}, ..., \theta^{(d)}$ and select best (lowest) $J_{\text{cv}}(\theta^{(i)})$
 estimate generalization error for the test set $J_{\text{test}}(\theta^{(i)})$
Tuning Learning Parameter
General algorithm
 Split data set into
 Learning set
 Validation set
 Test set
 use validation set for tuning control parameters
 use test set only for final evaluation
Let $\gamma$ be the control parameter
 for every possible value of $\gamma$
 build a classification model $C$ using the learning set
 use the validation set to estimate the expected error rate of $C$
 select the optimal value of the control parameter
 that is the value of $\gamma$ s.t. the error rate is minimal
Suppose now we're fitting a model with highorder polynomial
 $h_{\theta}(x) = \theta_0 + \theta_1 x + ... + \theta_4 x^4$
 to prevent overfitting we use regularization
 $J(\theta) = \cfrac{1}{m} \sum \text{cost}(h_{\theta}(x^{(i)}), y^{(i)}) + \cfrac{\lambda}{2m} \sum_{j=1}^{m} \theta_j^2$
 (a) If $\lambda$ if large (say $\lambda = 10000$), all $\theta$ are penalized and $\theta_1 \approx \theta_2 \approx ... \approx 0$, $h_{\theta}(x) \approx \theta_0$
 (b) if $\lambda$ is intermediate, we fit well
 (c) if $\lambda$ is small (close to 0) we fit too well, i.e. we overfit
How can we chose good $\lambda$? Let's define
 $J_{\text{train}}(\theta) = \cfrac{1}{m} \sum \text{cost}(h_{\theta}(x^{(i)}), y^{(i)})$ (same as $J(\theta)$, but without regularization)
 $J_{\text{cv}}(\theta)$ and $J_{\text{test}}(\theta)$  same, but on crossvalidation and test datasets respectively
Now we
 Choose a range of possible values for $\lambda$ (say 0, 0.01, 0.02, 0.04, ..., 10.24)  that gives us 12 models to checks
 For each $\lambda^{(i)}$,
 calculate $\theta^{(i)}$,
 calculate $J_{\text{cv}}(\theta^{(i)})$,
 and take $\lambda^{(i)}$ with lowest $J_{\text{cv}}(\theta^{(i)})$
After that we report the test error $J_{\text{test}}(\theta^{(i)})$
KFold CrossValidation
$K$Fold CrossValidation
If we want to reduce variability in the data
 we can perform multiple rounds of crossvalidation using different partitions
 and then average the results over all the rounds
We're given a dataset $S$ sampled from the population $D$
 we partition $S$ into $K$ equal disjoint subsets $(T_1, ..., T_K)$ (typically 510 subsets)
 then perform $K$ steps, and at step $k$ do:
 use $R_k = S  T_k$ as the training set
 build classifier $C_k$ using $R_k$
 use $T_k$ as the test test, compute error $\delta_k = \text{error}(C_k, T_k)$
 let $\delta^* = \cfrac{1}{K} \sum_{k = 1}^K \delta_k$
 this is the expected error rate
 note that there's only two subsets at each iteration
 and they are used to estimate the true error
Tuning Learning Parameter
Choosing best value for parameter $\gamma$ with $K$Fold CrossValidation
 you have to put some data aside before training your classifier
 i.e. split your data into learning set and test set
 for every possible value of $\gamma$
 use $K$fold CrossValidation to estimate the expected error rate
 use the learning set as the set $S$ (and do $K$fold CrossValidation only on it)
 select the optimal value of $\gamma$
 and then use the test set for the final evaluation
Stratified KFold CrossValidation
What if we want to preserve the class distribution over $K$ runs?
 then for each $T_i$ pick up the same proportion of labels as in the original dataset

 it's very important if the test distribution is not uniform
See Also
Sources