ML Wiki

Overfitting

Overfitting (or high variance) - if we have too many features, the learning hypothesis may

Best way to see if you overfit:

split data in training and test set
train the model on training set
evaluate the model on the training set
evaluate the model on the test set
generalization error: difference between them, measures the ability to generalize

It's clear that a model overfits when we plot the generalization error

Generalization error can be decomposed into bias and variance

Dart throwing illustration:

example:

Suppose we have a set of data

Suppose we have the following set
We may underfit with just a line
- $g(\theta_0 + \theta_1 x_1 + \theta_2 x_2)$
We may perform just right, but missing some positive examples
- $g(\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_2^2 + \theta_5 x_1 x_2)$
Or we may overfit using high-polynomial model
- $g(\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_2^2 + \theta_5 x_1 x_2 + \theta_6 x_1^2 x_2 + \theta_7 x_1 x_2^2 + \theta_8 x_1^2 x_2^2 + \theta_9 x_1^3 + ...)$

The problem with it

To identify overfitting we can use Machine Learning Diagnosis:

plotting - doesn't work with many features
reducing the number of features
- manually select features to keep
- Model Selection algorithm (chooses good features by itself)
- but it may turn out that some of the features we want to throw away are significant
- Principal Component Analysis
Regularization
- keep all the features but reduce the magnitude of parameters
Cross-Validation
- test your hypotheses on cross-validation set