Evaluation of Binary Classifiers
Evaluation is important:
- models have to predict classes of new unlabeled data
- sometimes it’s an integral part of the training process (e.g. in Decision Tree (Data Mining) for pruning) (see Cross Validation)
- also it’s needed when we want to compare two or more different models (see Meta Learning)
Baseline
So for evaluating a classifier we need to set some baseline
- base rate
- accuracy of a trivial classifier
- the one that always predicts the majority class
- random rate
- accuracy of random guess
- need to have some domain knowledge to assign Random Distribution
Skewed Classes
- Suppose we have a binary classifier, e.g. cancer prediction.
- We built some classification model $h_{\theta}(x)$
- if we have $h_{\theta}(x) = 1$, we predict cancer, and if $h_{\theta}(x) = 0$, we predict no cancer.
- Then we find out that we have 1% errors for our classifier on test set, and 99% were correctly diagnosed
- So the ‘‘error rate’’ is 1%
- But now suppose only 0.5% of patients have cancer
- This is a ‘‘skewed class’’ - it’s a tiny portion of another class
- We would predict better by always returning 0 (by using the trivial classifier)
- (we’ll have 0.5% error which is better than 1%)
- $\Rightarrow$ We need different evaluation metrics, not just error rate
Confusion Matrix
Confusion matrix is a $2 \times 2$ Contingency Table
- We divide our predictions and mis-predictions into this matrix
+ Diagnostic Testing Measures [http://en.wikipedia.org/wiki/Template:DiagnosticTesting_Diagram] | colspan=”2” rowspan=”2” style=”border:none;” | colspan=”2” | Actual Class $y$ | Positive | Negative | rowspan=”2” | $h_{\theta}(x)$ Test outcome |
Test outcome positive |
style=”background:#ccffcc;” | ’'’True positive’’‘ ($\text{TP}$) |
style=”background:#eedddd;” | ’'’False positive’’‘ ($\text{FP}$, Type I error) |
Precision = $\cfrac{# \text{TP}}{# \text{TP} + # \text{FP}}$ |
Test outcome negative |
style=”background:#eedddd;” | ’'’False negative’’‘ ($\text{FN}$, Type II error) |
style=”background:#ccffcc;” | ’'’True negative’’‘ ($\text{TN}$) |
Negative predictive value = $\cfrac{# \text{TN}}{# \text{FN} + # \text{TN}}$ |
colspan=”2” style=”border:none;” | Sensitivity = $\cfrac{# \text{TP}}{# \text{TP} + # \text{FN}}$ |
Specificity = $\cfrac{# \text{TN}}{# \text{FP} + # \text{TN}}$ |
Accuracy = $\cfrac{# \text{TP} + # \text{TN}}{# \text{TOTAL}}$ |
Main values of this matrix:
- ’'’True Positive’’’ - we predicted “+” and the true class is “+”
- ’'’True Negative’’’ - we predicted “-“ and the true class is “-“
- ’'’False Positive’’’ - we predicted “+” and the true class is “-“ (Type I error)
- ’'’False Negative’’’ - we predicted “-“ and the true class is “+” (Type II error)
- (see also Statistical Tests of Significance#Type I and Type II Errors)
The following measures can be calculated:
- Accuracy
- Misclassification Error (or Error Rate)
- Positive predictive value (or precision)
- $P = \cfrac{\text{TP}}{\text{TP} + \text{FP}}$
- Negative predictive value
- True Positive Rate (also Sensitivity or Recall)
- Fraction of positive examples correctly classified
- $\text{tpr} = \cfrac{\text{TP}}{\text{TP} + \text{FN}}$
- False Positive Rate (also Fall-Out)
- Fraction of negative examples incorrectly classified
- $\text{fpr} = \cfrac{\text{FP}}{\text{FP} + \text{TN}}$
- Specificity
- Support - fraction of positively classified examples
- $\text{sup} = \cfrac{\text{TP} + \text{FP}}{N} = \cfrac{\text{predicted pos}}{\text{total}}$
Accuracy and Error Rate
In practice, these are the most widely used metrics
- Accuracy: $\text{acc} = \cfrac{TP + TN}{N}$
- fraction of correctly classified examples
- Error Rate: $\text{error} = \cfrac{FN + FP}{N} = 1 - \text{acc}$
- Fraction of misclassified examples
Precision
For all input data that we predicted $h_{\theta}(x) = 1$ what fraction actually have $y = 1$?
$P = \text{Precision} = \cfrac{\text{# TP}}{\text{# predicted positives}} = \cfrac{\text{# TP}}{\text{# TP} + \text{# FP}}$
- Out of all the people we thought have cancer, how many actually had it?
- High precision is good
- we don’t tell many people that they have cancer when they actually don’t
Recall
For all input data that actually have $y = 1$, what fraction did we correctly detect as $h_{\theta}(x) = 1$?
$R = \text{Recall} = \cfrac{\text{# TP}}{\text{# actual positives}} = \cfrac{\text{# TP}}{\text{# TP + # FN}}$
- Out of all the people that do actually have cancer, how much we identified?
- The higher the better:
-
We don’t fail to spot many people that actually have cancer
- For a classifier that always returns zero (i.e. $h_{\theta}(x) = 0$) the Recall would be zero
- That gives us more useful evaluation metric
- And we’re much more sure
The F Measure is a combination of Precision and Recall
Example
+ Diagnostic Testing Wikipedia Example [http://en.wikipedia.org/wiki/Template:DiagnosticTesting_Example] | colspan=”2” rowspan=”2” style=”border:none;” | colspan=”2” style=”background:#eeeebb;” | ’'’Patients with bowel cancer (as confirmed on endoscopy)’’’ |
style=”background:#ffffcc;” | Positive | style=”background:#ddddaa;” | Negative | rowspan=”2” style=”background:#bbeeee;” | ’'’Fecal Occult Blood Screen Test Outcome’’’ |
style=”background:#ccffff;” | Test Outcome Positive |
style=”background:#ccffcc;” | ’'’True Positive’’’ (TP) = 20 |
style=”background:#eedddd;” | ’'’False Positive’’’ (FP) = 180 |
style=”background:#ccffff;” | Positive predictive value<div style="text-align:left; margin-left:1em;">= TP / (TP + FP) = 20 / (20 + 180) = ‘'’10%’’’</div> |
style=”background:#aadddd;” | Test Outcome Negative |
style=”background:#eedddd;” | ’'’False Negative’’’ (FN) = 10 |
style=”background:#bbeebb;” | ’'’True Negative’’’ (TN) = 1820 |
style=”background:#aadddd;” | Negative predictive value<div style="text-align:left; margin-left:1em;">= TN / (FN + TN) = 1820 / (10 + 1820) ≈ ‘'’99.5%’’’</div> |
colspan=”2” style=”border:none;” | style=”background:#ffffcc;” | Sensitivity<div style="text-align:left;">= TP / (TP + FN) = 20 / (20 + 10) ≈ ‘'’67%’’’</div> |
style=”background:#ddddaa;” | Specificity<div style="text-align:left;">= TN / (FP + TN) = 1820 / (180 + 1820) = ‘'’91%’’’</div> |
Visual Analysis
Visual ways of evaluating the performance of a classifier
- ROC Analysis - True Positive Rate vs False Positive Rate
- Cumulative Gain Charts - True Positive Rate vs Predicted Positive Rate
Not Binary Classifiers
When we have multi-class classifiers we can use:
- Contingency Table
- just show misclassified examples side-by-side
- Cost Matrix
- we define the cost for each misclassification
- and calculate the total cost
- some measures can be extended to multiclass classifiers:
See Also
Sources
- Machine Learning (coursera)
- Data Mining (UFRT)
- http://en.wikipedia.org/wiki/Binary_classification#Evaluation_of_binary_classifiers
- http://en.wikipedia.org/wiki/Template:DiagnosticTesting_Diagram
- http://en.wikipedia.org/wiki/Template:DiagnosticTesting_Example
- Introduction to Data Science (coursera)