ML Wiki
Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

Evaluation of Binary Classifiers

Evaluation of Binary Classifiers

Evaluation is important:

Baseline

So for evaluating a classifier we need to set some baseline

  • base rate
    • accuracy of a trivial classifier
    • the one that always predicts the majority class
  • random rate
    • accuracy of random guess
    • need to have some domain knowledge to assign Random Distribution

Skewed Classes

  • Suppose we have a binary classifier, e.g. cancer prediction.
    • We built some classification model $h_{\theta}(x)$
    • if we have $h_{\theta}(x) = 1$, we predict cancer, and if $h_{\theta}(x) = 0$, we predict no cancer.
  • Then we find out that we have 1% errors for our classifier on test set, and 99% were correctly diagnosed
    • So the ‘‘error rate’’ is 1%
  • But now suppose only 0.5% of patients have cancer
    • This is a ‘‘skewed class’’ - it’s a tiny portion of another class
  • We would predict better by always returning 0 (by using the trivial classifier)
    • (we’ll have 0.5% error which is better than 1%)
  • $\Rightarrow$ We need different evaluation metrics, not just error rate

Confusion Matrix

Confusion matrix is a $2 \times 2$ Contingency Table

  • We divide our predictions and mis-predictions into this matrix
+ Diagnostic Testing Measures [http://en.wikipedia.org/wiki/Template:DiagnosticTesting_Diagram]   colspan=”2” rowspan=”2” style=”border:none;”     colspan=”2” Actual Class $y$ Positive   Negative   rowspan=”2” $h_{\theta}(x)$
Test
outcome
Test
outcome
positive
  style=”background:#ccffcc;” ’'’True positive’’‘
($\text{TP}$)
  style=”background:#eedddd;” ’'’False positive’’‘
($\text{FP}$, Type I error)
  Precision =
$\cfrac{# \text{TP}}{# \text{TP} + # \text{FP}}$
  Test
outcome
negative
  style=”background:#eedddd;” ’'’False negative’’‘
($\text{FN}$, Type II error)
  style=”background:#ccffcc;” ’'’True negative’’‘
($\text{TN}$)
  Negative predictive value =
$\cfrac{# \text{TN}}{# \text{FN} + # \text{TN}}$
  colspan=”2” style=”border:none;”     Sensitivity =
$\cfrac{# \text{TP}}{# \text{TP} + # \text{FN}}$
  Specificity =
$\cfrac{# \text{TN}}{# \text{FP} + # \text{TN}}$
  Accuracy =
$\cfrac{# \text{TP} + # \text{TN}}{# \text{TOTAL}}$

Main values of this matrix:

  • ’'’True Positive’’’ - we predicted “+” and the true class is “+”
  • ’'’True Negative’’’ - we predicted “-“ and the true class is “-“
  • ’'’False Positive’’’ - we predicted “+” and the true class is “-“ (Type I error)
  • ’'’False Negative’’’ - we predicted “-“ and the true class is “+” (Type II error)
  • (see also Statistical Tests of Significance#Type I and Type II Errors)

The following measures can be calculated:

  • Accuracy
  • Misclassification Error (or Error Rate)
  • Positive predictive value (or precision)
    • $P = \cfrac{\text{TP}}{\text{TP} + \text{FP}}$
  • Negative predictive value
  • True Positive Rate (also Sensitivity or Recall)
    • Fraction of positive examples correctly classified
    • $\text{tpr} = \cfrac{\text{TP}}{\text{TP} + \text{FN}}$
  • False Positive Rate (also Fall-Out)
    • Fraction of negative examples incorrectly classified
    • $\text{fpr} = \cfrac{\text{FP}}{\text{FP} + \text{TN}}$
  • Specificity
  • Support - fraction of positively classified examples
    • $\text{sup} = \cfrac{\text{TP} + \text{FP}}{N} = \cfrac{\text{predicted pos}}{\text{total}}$

Accuracy and Error Rate

In practice, these are the most widely used metrics

  • Accuracy: $\text{acc} = \cfrac{TP + TN}{N}$
    • fraction of correctly classified examples
  • Error Rate: $\text{error} = \cfrac{FN + FP}{N} = 1 - \text{acc}$
    • Fraction of misclassified examples

Precision

For all input data that we predicted $h_{\theta}(x) = 1$ what fraction actually have $y = 1$?

$P = \text{Precision} = \cfrac{\text{# TP}}{\text{# predicted positives}} = \cfrac{\text{# TP}}{\text{# TP} + \text{# FP}}$

  • Out of all the people we thought have cancer, how many actually had it?
  • High precision is good
  • we don’t tell many people that they have cancer when they actually don’t

Recall

For all input data that actually have $y = 1$, what fraction did we correctly detect as $h_{\theta}(x) = 1$?

$R = \text{Recall} = \cfrac{\text{# TP}}{\text{# actual positives}} = \cfrac{\text{# TP}}{\text{# TP + # FN}}$

  • Out of all the people that do actually have cancer, how much we identified?
  • The higher the better:
  • We don’t fail to spot many people that actually have cancer

  • For a classifier that always returns zero (i.e. $h_{\theta}(x) = 0$) the Recall would be zero
  • That gives us more useful evaluation metric
  • And we’re much more sure

The F Measure is a combination of Precision and Recall

Example

+ Diagnostic Testing Wikipedia Example [http://en.wikipedia.org/wiki/Template:DiagnosticTesting_Example]   colspan=”2” rowspan=”2” style=”border:none;”     colspan=”2” style=”background:#eeeebb;” ’'’Patients with bowel cancer
(as confirmed on endoscopy)’’’
  style=”background:#ffffcc;” Positive   style=”background:#ddddaa;” Negative   rowspan=”2” style=”background:#bbeeee;” ’'’Fecal
Occult
Blood
Screen
Test
Outcome’’’
  style=”background:#ccffff;” Test
Outcome
Positive
  style=”background:#ccffcc;” ’'’True Positive’’’
(TP) = 20
  style=”background:#eedddd;” ’'’False Positive’’’
(FP) = 180
  style=”background:#ccffff;” Positive predictive value<div style="text-align:left; margin-left:1em;">= TP / (TP + FP)
= 20 / (20 + 180)
= ‘'’10%’’’</div>
  style=”background:#aadddd;” Test
Outcome
Negative
  style=”background:#eedddd;” ’'’False Negative’’’
(FN) = 10
  style=”background:#bbeebb;” ’'’True Negative’’’
(TN) = 1820
  style=”background:#aadddd;” Negative predictive value<div style="text-align:left; margin-left:1em;">= TN / (FN + TN)
= 1820 / (10 + 1820)
≈ ‘'’99.5%’’’</div>
  colspan=”2” style=”border:none;”     style=”background:#ffffcc;” Sensitivity<div style="text-align:left;">= TP / (TP + FN)
= 20 / (20 + 10)
≈ ‘'’67%’’’</div>
  style=”background:#ddddaa;” Specificity<div style="text-align:left;">= TN / (FP + TN)
= 1820 / (180 + 1820)
= ‘'’91%’’’</div>

Visual Analysis

Visual ways of evaluating the performance of a classifier

Not Binary Classifiers

When we have multi-class classifiers we can use:

See Also

Sources