ML Wiki

Evaluation of Binary Classifiers

Evaluation is important:

Baseline

So for evaluating a classifier we need to set some baseline

• base rate
• accuracy of a trivial classifier
• the one that always predicts the majority class
• random rate
• accuracy of random guess
• need to have some domain knowledge to assign Random Distribution

Skewed Classes

• Suppose we have a binary classifier, e.g. cancer prediction.
• We built some classification model $h_{\theta}(x)$
• if we have $h_{\theta}(x) = 1$, we predict cancer, and if $h_{\theta}(x) = 0$, we predict no cancer.
• Then we find out that we have 1% errors for our classifier on test set, and 99% were correctly diagnosed
• So the error rate is 1%
• But now suppose only 0.5% of patients have cancer
• This is a skewed class - it's a tiny portion of another class
• We would predict better by always returning 0 (by using the trivial classifier)
• (we'll have 0.5% error which is better than 1%)
• $\Rightarrow$ We need different evaluation metrics, not just error rate

Confusion Matrix

Confusion matrix is a $2 \times 2$ Contingency Table

• We divide our predictions and mis-predictions into this matrix

 Actual Class $y$ $h_{\theta}(x)$ Testoutcome Positive Negative Testoutcomepositive True positive ($\text{TP}$) False positive($\text{FP}$, Type I error) Precision = $\cfrac{\# \text{TP}}{\# \text{TP} + \# \text{FP}}$ Testoutcomenegative False negative($\text{FN}$, Type II error) True negative ($\text{TN}$) Negative predictive value = $\cfrac{\# \text{TN}}{\# \text{FN} + \# \text{TN}}$ Sensitivity = $\cfrac{\# \text{TP}}{\# \text{TP} + \# \text{FN}}$ Specificity = $\cfrac{\# \text{TN}}{\# \text{FP} + \# \text{TN}}$ Accuracy = $\cfrac{\# \text{TP} + \# \text{TN}}{\# \text{TOTAL}}$

Main values of this matrix:

• True Positive - we predicted "+" and the true class is "+"
• True Negative - we predicted "-" and the true class is "-"
• False Positive - we predicted "+" and the true class is "-" (Type I error)
• False Negative - we predicted "-" and the true class is "+" (Type II error)
• (see also Statistical Tests of Significance#Type I and Type II Errors)

The following measures can be calculated:

• Accuracy
• Misclassification Error (or Error Rate)
• Positive predictive value (or precision)
• $P = \cfrac{\text{TP}}{\text{TP} + \text{FP}}$
• Negative predictive value
• True Positive Rate (also Sensitivity or Recall)
• Fraction of positive examples correctly classified
• $\text{tpr} = \cfrac{\text{TP}}{\text{TP} + \text{FN}}$
• False Positive Rate (also Fall-Out)
• Fraction of negative examples incorrectly classified
• $\text{fpr} = \cfrac{\text{FP}}{\text{FP} + \text{TN}}$
• Specificity
• Support - fraction of positively classified examples
• $\text{sup} = \cfrac{\text{TP} + \text{FP}}{N} = \cfrac{\text{predicted pos}}{\text{total}}$

Accuracy and Error Rate

In practice, these are the most widely used metrics

• Accuracy: $\text{acc} = \cfrac{TP + TN}{N}$
• fraction of correctly classified examples
• Error Rate: $\text{error} = \cfrac{FN + FP}{N} = 1 - \text{acc}$
•  Fraction of misclassified examples

Precision

For all input data that we predicted $h_{\theta}(x) = 1$ what fraction actually have $y = 1$?

$P = \text{Precision} = \cfrac{\text{# TP}}{\text{# predicted positives}} = \cfrac{\text{# TP}}{\text{# TP} + \text{# FP}}$

• Out of all the people we thought have cancer, how many actually had it?
• High precision is good
• we don't tell many people that they have cancer when they actually don't

Recall

For all input data that actually have $y = 1$, what fraction did we correctly detect as $h_{\theta}(x) = 1$?

$R = \text{Recall} = \cfrac{\text{# TP}}{\text{# actual positives}} = \cfrac{\text{# TP}}{\text{# TP + # FN}}$

• Out of all the people that do actually have cancer, how much we identified?
• The higher the better:
• We don't fail to spot many people that actually have cancer

• For a classifier that always returns zero (i.e. $h_{\theta}(x) = 0$) the Recall would be zero
• That gives us more useful evaluation metric
• And we're much more sure

The F Measure is a combination of Precision and Recall

Example

 Patients with bowel cancer(as confirmed on endoscopy) Positive Negative FecalOccultBloodScreenTestOutcome TestOutcomePositive True Positive(TP) = 20 False Positive(FP) = 180 Positive predictive value= TP / (TP + FP)= 20 / (20 + 180)= 10% TestOutcomeNegative False Negative(FN) = 10 True Negative(TN) = 1820 Negative predictive value= TN / (FN + TN)= 1820 / (10 + 1820)≈ 99.5% Sensitivity= TP / (TP + FN)= 20 / (20 + 10)≈ 67% Specificity= TN / (FP + TN)= 1820 / (180 + 1820)= 91%

Visual Analysis

Visual ways of evaluating the performance of a classifier

Not Binary Classifiers

When we have multi-class classifiers we can use: