Evaluation of Binary Classifiers

Evaluation is important:


Baseline

So for evaluating a classifier we need to set some baseline

  • base rate
    • accuracy of a trivial classifier
    • the one that always predicts the majority class
  • random rate
    • accuracy of random guess
    • need to have some domain knowledge to assign Random Distribution


Skewed Classes

  • Suppose we have a binary classifier, e.g. cancer prediction.
    • We built some classification model $h_{\theta}(x)$
    • if we have $h_{\theta}(x) = 1$, we predict cancer, and if $h_{\theta}(x) = 0$, we predict no cancer.
  • Then we find out that we have 1% errors for our classifier on test set, and 99% were correctly diagnosed
    • So the error rate is 1%
  • But now suppose only 0.5% of patients have cancer
    • This is a skewed class - it's a tiny portion of another class
  • We would predict better by always returning 0 (by using the trivial classifier)
    • (we'll have 0.5% error which is better than 1%)
  • $\Rightarrow$ We need different evaluation metrics, not just error rate


Confusion Matrix

Confusion matrix is a $2 \times 2$ Contingency Table

  • We divide our predictions and mis-predictions into this matrix


Diagnostic Testing Measures [1]
Actual Class $y$
Positive Negative
$h_{\theta}(x)$
Test
outcome
Test
outcome
positive
True positive
($\text{TP}$)
False positive
($\text{FP}$, Type I error)
Precision =
$\cfrac{\# \text{TP}}{\# \text{TP} + \# \text{FP}}$
Test
outcome
negative
False negative
($\text{FN}$, Type II error)
True negative
($\text{TN}$)
Negative predictive value =
$\cfrac{\# \text{TN}}{\# \text{FN} + \# \text{TN}}$
Sensitivity =
$\cfrac{\# \text{TP}}{\# \text{TP} + \# \text{FN}}$
Specificity =
$\cfrac{\# \text{TN}}{\# \text{FP} + \# \text{TN}}$
Accuracy =
$\cfrac{\# \text{TP} + \# \text{TN}}{\# \text{TOTAL}}$


Main values of this matrix:

  • True Positive - we predicted "+" and the true class is "+"
  • True Negative - we predicted "-" and the true class is "-"
  • False Positive - we predicted "+" and the true class is "-" (Type I error)
  • False Negative - we predicted "-" and the true class is "+" (Type II error)
  • (see also Statistical Tests of Significance#Type I and Type II Errors)


The following measures can be calculated:

  • Accuracy
  • Misclassification Error (or Error Rate)
  • Positive predictive value (or precision)
    • $P = \cfrac{\text{TP}}{\text{TP} + \text{FP}}$
  • Negative predictive value
  • True Positive Rate (also Sensitivity or Recall)
    • Fraction of positive examples correctly classified
    • $\text{tpr} = \cfrac{\text{TP}}{\text{TP} + \text{FN}}$
  • False Positive Rate (also Fall-Out)
    • Fraction of negative examples incorrectly classified
    • $\text{fpr} = \cfrac{\text{FP}}{\text{FP} + \text{TN}}$
  • Specificity
  • Support - fraction of positively classified examples
    • $\text{sup} = \cfrac{\text{TP} + \text{FP}}{N} = \cfrac{\text{predicted pos}}{\text{total}}$


Accuracy and Error Rate

In practice, these are the most widely used metrics

  • Accuracy: $\text{acc} = \cfrac{TP + TN}{N}$
    • fraction of correctly classified examples
  • Error Rate: $\text{error} = \cfrac{FN + FP}{N} = 1 - \text{acc}$
    •  Fraction of misclassified examples


Precision

For all input data that we predicted $h_{\theta}(x) = 1$ what fraction actually have $y = 1$?

$P = \text{Precision} = \cfrac{\text{# TP}}{\text{# predicted positives}} = \cfrac{\text{# TP}}{\text{# TP} + \text{# FP}}$

  • Out of all the people we thought have cancer, how many actually had it?
  • High precision is good
  • we don't tell many people that they have cancer when they actually don't


Recall

For all input data that actually have $y = 1$, what fraction did we correctly detect as $h_{\theta}(x) = 1$?

$R = \text{Recall} = \cfrac{\text{# TP}}{\text{# actual positives}} = \cfrac{\text{# TP}}{\text{# TP + # FN}}$

  • Out of all the people that do actually have cancer, how much we identified?
  • The higher the better:
  • We don't fail to spot many people that actually have cancer


  • For a classifier that always returns zero (i.e. $h_{\theta}(x) = 0$) the Recall would be zero
  • That gives us more useful evaluation metric
  • And we're much more sure


The F Measure is a combination of Precision and Recall


Example

Diagnostic Testing Wikipedia Example [2]
Patients with bowel cancer
(as confirmed on endoscopy)
Positive Negative
Fecal
Occult
Blood
Screen
Test
Outcome
Test
Outcome
Positive
True Positive
(TP) = 20
False Positive
(FP) = 180
Positive predictive value
= TP / (TP + FP)
= 20 / (20 + 180)
= 10%
Test
Outcome
Negative
False Negative
(FN) = 10
True Negative
(TN) = 1820
Negative predictive value
= TN / (FN + TN)
= 1820 / (10 + 1820)
99.5%
Sensitivity
= TP / (TP + FN)
= 20 / (20 + 10)
67%
Specificity
= TN / (FP + TN)
= 1820 / (180 + 1820)
= 91%



Visual Analysis

Visual ways of evaluating the performance of a classifier


Not Binary Classifiers

When we have multi-class classifiers we can use:


See Also


Sources

Share your opinion