Comparing Learning Algorithms

machine-learning model-performance-evaluation statistical-tests

Comparing Learning Algorithms

Comparing Classifiers

A lot of things

Contingency Tables
Cost Matrix
best accuracy or $F_1$-score (see Evaluation of Binary Classifiers)
best True Error of Model
best Cumulative Gain Chart
other things

Comparing Algorithms

What if we want to compare

two learning algorithms $A_1$ and $A_2$
not just two specific classifiers $C_1$ and $C_2$

We want to measure the expected error

for $R \subset D$ where $D$ is the population, we want to measure
$E \Big[ \text{error} \big(A_1(R), D \big) - \text{error} \big(A_2(R), D \big) \Big]$
- with $\text{error} \big(A_1(R), D \big)$ being the true error

How to estimate it?

average result over many different training sets
ideally all of these sets must be independent from each other
but usually use K-Fold Cross-Validation

$K$-Fold Cross-Validation

Estimation:

given a data set $S$
for $k = 1..K$,
- let $T_k$ be the training set,
- and $R_k = S - T_k$
- train $C_1 = A_1(R_k)$ and $C_2 = A_2(R_k)$
- let $\delta_k = \text{error}(C_1, T_k) - \text{error}(C_2, T_k)$
let $\delta^* = \cfrac{1}{K} \sum_{k = 1}^K \delta_k$
- this is the estimate of the expected error $E \Big[ \text{error} \big(A_1(R), D \big) - \text{error} \big(A_2(R), D \big) \Big]$

$K$-Fold CV Paired $t$-Test

Let’s conduct a Statistical Tests of Significance:

assume (under the null hypothesis) that $A_1$ and $A_2$ have equal expected accuracy
$t = \cfrac{\delta^*}{\sigma}$ follows the Student distribution with $K-1$ degrees of freedom
- $\delta^* = \cfrac{1}{K} \sum_{k = 1}^K \delta_k$ - the estimate of the expected error
- $\sigma = \sqrt{ \cfrac{1}{K \cdot (K - 1)} \sum_{k = 1}^K (\delta_k - \delta^*)^2 }$

Test:

the hypothesis $H_A$: $A_1$ and $A_2$ have equal expected error rate
$H_A$ is accepted with $N = (1 - \alpha)\%$ confidence

if $

< t_{1 - \alpha / 2, K - 1}$

- $t_{1 - \alpha / 2, K - 1}$ is the $1 - \alpha / 2$ percentile of the Student distribution with $K-1$ degrees of freedom

Examples

Example 1

| $k$ | 1 | 2 | 3 | 4 | 5 | $\text{error}(A_1, T_k)$ | 0.12 | 0.15 | 0.14 | 0.16 | 0.11 || $\text{error}(A_2, T_k)$ | 0.13 | 0.16 | 0.13 | 0.14 | 0.17 || $\delta_k$ | -0.01 | -0.01 | 0.01 | 0.02 | -0.06 || $(\delta_k - \delta^)^2$ with $\delta^ = -0.01$ | 0.0000 | 0.0000 | 0.0004 | 0.0009 | 0.0025 || $\cfrac{\delta^*}{\sigma} $ | -0.73 | | | | | So,

$\left

\cfrac{\delta^*}{\sigma} \right

= 0.73$ and $t_{97.5, 4} = 2.77$

- the hypothesis $H_A$ is accepted with 95% confidence

Example 2

given two algorithms $A_1$ and $A_2$
10 rounds are performed

| + Obtained error rates || $k$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | $A_1$ | 0.305 | 0.322 | 0.207 | 0.206 | 0.31 | 0.41 | 0.277 | 0.26 | 0.215 | 0.26 || $A_2$ | 0.224 | 0.145 | 0.224 | 0.196 | 0.207 | 0.204 | 0.221 | 0.194 | 0.162 | 0.35 | So we calculate and obtain the following table

| $k$ | $A_1$ | $A_2$ | $\delta_k$ | $(\delta_k - \delta^)^2$ | 1 | 0.305 | 0.224 | 0.081 | 0.00027225 || 2 | 0.322 | 0.145 | 0.177 | 0.01265625 || 3 | 0.207 | 0.224 | -0.017 | 0.00664225 || 4 | 0.206 | 0.196 | 0.01 | 0.00297025 || 5 | 0.31 | 0.207 | 0.103 | 0.00148225 || 6 | 0.41 | 0.204 | 0.206 | 0.02002225 || 7 | 0.277 | 0.221 | 0.056 | 0.00007225 || 8 | 0.26 | 0.194 | 0.066 | 0.00000225 || 9 | 0.215 | 0.162 | 0.053 | 0.00013225 || 10 | 0.26 | 0.35 | -0.09 | 0.02387025 || colspan=”3” | $\delta^ = \cfrac{1}{10} \sum \delta_k =$ | 0.0645 | || colspan=”4” | $\sigma^2 = \cfrac{1}{9 \cdot 10} \sum (\delta_k – \delta^)^2 =$ | 0.0007569167 | $t = \cfrac{\delta^}{\sigma^2} \approx 2.34$

we have 9 degrees of freedom
for confidence level $\alpha = 0.05$ $\Rightarrow$ need to look for $1 - \alpha/2 = 0.975/%$ percentile
Calculate the $t$-value: $t_{9, 0.975} = 2.26$
$t = 2.34 \geqslant 2.26 = t_{9, 0.975}$
$\Rightarrow$ the hypothesis that $A_1$ and $A_2$ have the same error rate is rejected

Sources

Data Mining (UFRT)

✏️ Edit on GitHub

Comparing Learning Algorithms

Comparing Learning Algorithms

Comparing Classifiers

Comparing Algorithms

$K$-Fold Cross-Validation

$K$-Fold CV Paired $t$-Test

Examples

Example 1

Example 2

See Also

Sources