True Error of Model

machine-learning model-performance-evaluation statistics

What do we do when we want to know how accurately the model will perform in practice

Given:

Problem

the sample error of $C$ calculated on sample $S$ is

But usually we have training and testing sets (see Cross-Validation)

i.e. we have some data set $S$ (drawn from the population with distribution $P$)
learning set $R \subset S$,
training set $T \subset S$,
$R$ and $T$ are disjoint: $R \cap T = \varnothing$
so the sample error is computed against $T$: $\text{error}(C, T)$

the true error of $C$ w.r.t distribution $S$ on the population $D$

is the probability to misclassify an instance drawn from $D$ at random
$\text{error}(C, D) = \sum_{(x,y) \in D} P(x, y) \cdot \delta(C(x) \ne y)$
- $P(x, y)$ is the probability to draw a pair $(x,y) \in D$

Estimate of $\text{error}(C, D)$

More accurate estimates:

✏️ Edit on GitHub