Evaluation of Binary Classifiers
Evaluation is important:
Baseline
So for evaluating a classifier we need to set some baseline
 base rate
 accuracy of a trivial classifier
 the one that always predicts the majority class
 random rate
 accuracy of random guess
 need to have some domain knowledge to assign Random Distribution
Skewed Classes
 Suppose we have a binary classifier, e.g. cancer prediction.
 We built some classification model $h_{\theta}(x)$
 if we have $h_{\theta}(x) = 1$, we predict cancer, and if $h_{\theta}(x) = 0$, we predict no cancer.
 Then we find out that we have 1% errors for our classifier on test set, and 99% were correctly diagnosed
 But now suppose only 0.5% of patients have cancer
 This is a skewed class  it's a tiny portion of another class
 We would predict better by always returning 0 (by using the trivial classifier)
 (we'll have 0.5% error which is better than 1%)
 $\Rightarrow$ We need different evaluation metrics, not just error rate
Confusion Matrix
Confusion matrix is a $2 \times 2$ Contingency Table
 We divide our predictions and mispredictions into this matrix
Diagnostic Testing Measures [1]

Actual Class $y$

Positive

Negative

$h_{\theta}(x)$ Test outcome

Test outcome positive

True positive ($\text{TP}$)

False positive ($\text{FP}$, Type I error)

Precision = $\cfrac{\# \text{TP}}{\# \text{TP} + \# \text{FP}}$

Test outcome negative

False negative ($\text{FN}$, Type II error)

True negative ($\text{TN}$)

Negative predictive value = $\cfrac{\# \text{TN}}{\# \text{FN} + \# \text{TN}}$


Sensitivity = $\cfrac{\# \text{TP}}{\# \text{TP} + \# \text{FN}}$

Specificity = $\cfrac{\# \text{TN}}{\# \text{FP} + \# \text{TN}}$

Accuracy = $\cfrac{\# \text{TP} + \# \text{TN}}{\# \text{TOTAL}}$

Main values of this matrix:
 True Positive  we predicted "+" and the true class is "+"
 True Negative  we predicted "" and the true class is ""
 False Positive  we predicted "+" and the true class is "" (Type I error)
 False Negative  we predicted "" and the true class is "+" (Type II error)
 (see also Statistical Tests of Significance#Type I and Type II Errors)
The following measures can be calculated:
 Accuracy
 Misclassification Error (or Error Rate)
 Positive predictive value (or precision)
 $P = \cfrac{\text{TP}}{\text{TP} + \text{FP}}$
 Negative predictive value
 True Positive Rate (also Sensitivity or Recall)
 Fraction of positive examples correctly classified
 $\text{tpr} = \cfrac{\text{TP}}{\text{TP} + \text{FN}}$
 False Positive Rate (also FallOut)
 Fraction of negative examples incorrectly classified
 $\text{fpr} = \cfrac{\text{FP}}{\text{FP} + \text{TN}}$
 Specificity
 Support  fraction of positively classified examples
 $\text{sup} = \cfrac{\text{TP} + \text{FP}}{N} = \cfrac{\text{predicted pos}}{\text{total}}$
Accuracy and Error Rate
In practice, these are the most widely used metrics
 Accuracy: $\text{acc} = \cfrac{TP + TN}{N}$
 fraction of correctly classified examples
 Error Rate: $\text{error} = \cfrac{FN + FP}{N} = 1  \text{acc}$
 Fraction of misclassified examples
Precision
For all input data that we predicted $h_{\theta}(x) = 1$ what fraction actually have $y = 1$?
$P = \text{Precision} = \cfrac{\text{# TP}}{\text{# predicted positives}} = \cfrac{\text{# TP}}{\text{# TP} + \text{# FP}}$
 Out of all the people we thought have cancer, how many actually had it?
 High precision is good
 we don't tell many people that they have cancer when they actually don't
Recall
For all input data that actually have $y = 1$, what fraction did we correctly detect as $h_{\theta}(x) = 1$?
$R = \text{Recall} = \cfrac{\text{# TP}}{\text{# actual positives}} = \cfrac{\text{# TP}}{\text{# TP + # FN}}$
 Out of all the people that do actually have cancer, how much we identified?
 The higher the better:
 We don't fail to spot many people that actually have cancer
 For a classifier that always returns zero (i.e. $h_{\theta}(x) = 0$) the Recall would be zero
 That gives us more useful evaluation metric
 And we're much more sure
The F Measure is a combination of Precision and Recall
Example
Diagnostic Testing Wikipedia Example [2]

Patients with bowel cancer (as confirmed on endoscopy)

Positive

Negative

Fecal Occult Blood Screen Test Outcome

Test Outcome Positive

True Positive (TP) = 20

False Positive (FP) = 180

Positive predictive value = TP / (TP + FP) = 20 / (20 + 180) = 10%

Test Outcome Negative

False Negative (FN) = 10

True Negative (TN) = 1820

Negative predictive value = TN / (FN + TN) = 1820 / (10 + 1820) ≈ 99.5%


Sensitivity = TP / (TP + FN) = 20 / (20 + 10) ≈ 67%

Specificity = TN / (FP + TN) = 1820 / (180 + 1820) = 91%

Visual Analysis
Visual ways of evaluating the performance of a classifier
Not Binary Classifiers
When we have multiclass classifiers we can use:
 Contingency Table
 just show misclassified examples sidebyside
 Cost Matrix
 we define the cost for each misclassification
 and calculate the total cost
 some measures can be extended to multiclass classifiers:
See Also
Sources