ML Wiki

Precision and Recall

Precision and Recall are quality metrics used across many domains:

Precision and Recall for Information Retrieval

IR system has to be:

• precise: all returned document should be relevant
• efficient: all relevant document should be returned

Given a test collection, the quality of an IR system is evaluated with:

• Precision: % of relevant documents in the result
• Recall: % of retrieved relevant documents

More formally,

• given a collection of documents $C$
• If $X \subseteq C$ is the output of the IR system and $Y \subseteq C$ is the list of all relevant documents then define
• precision as $P = \cfrac{|X \cup Y|}{| X|}$ and recall as $R = \cfrac{|X \cup Y|}{| Y |}$
• both $P$ and $R$ are defined w.r.t a set of retrieved documents

Precision/Recall Curves

• If we retrieve more document, we improve recall (if return all docs, $R = 1$)
• if we retrieve fewer documents, we improve precision, but reduce recall
• so there's a trade-off between them

Let $k$ be the number of retrieved documents

• then by varying $k$ from $0$ to $N = |C|$ we can draw $P$ vs $R$ and obtain the Precision/recall curve:
• source: 
• the closer the curve to the $(1, 1)$ point - the better the IR system performance
• source: Information Retrieval (UFRT) lecture 2

Area under P/R Curve:

• Analogously to ROC Curves we can calculate the area under the P/R Curve
• the closer AUPR to 1 the better

Average Precision

Top-$k$-precision is insensitive to change of ranks of relevant documents among top $k$

how to measure overall performance of an IR system?

$\text{avg P} = \cfrac{1}{K} \sum_{k = 1}^K \cfrac{k}{r_k}$

• where $r_i$ is the rank of $k$th relevant document in the result

Since in a test collection we usually have a set of queries, we calcuate the average over them and get Mean Average Precision: MAP

Precision and Recall for Classification

The precision and recall metrics can also be applied to Machine Learning: to binary classifiers

 Actual Class $y$ $h_{\theta}(x)$ Testoutcome Positive Negative Testoutcomepositive True positive ($\text{TP}$) False positive($\text{FP}$, Type I error) Precision = $\cfrac{\# \text{TP}}{\# \text{TP} + \# \text{FP}}$ Testoutcomenegative False negative($\text{FN}$, Type II error) True negative ($\text{TN}$) Negative predictive value = $\cfrac{\# \text{TN}}{\# \text{FN} + \# \text{TN}}$ Sensitivity = $\cfrac{\# \text{TP}}{\# \text{TP} + \# \text{FN}}$ Specificity = $\cfrac{\# \text{TN}}{\# \text{FP} + \# \text{TN}}$ Accuracy = $\cfrac{\# \text{TP} + \# \text{TN}}{\# \text{TOTAL}}$

Main values of this matrix:

• True Positive - we predicted "+" and the true class is "+"
• True Negative - we predicted "-" and the true class is "-"
• False Positive - we predicted "+" and the true class is "-" (Type I error)
• False Negative - we predicted "-" and the true class is "+" (Type II error)

Two Classes: $C_+$ and $C_-$

Precision

Precision

• $\pi = P\big(f(\mathbf x) = C_+ \, \big| \, h_{\theta}(\mathbf x) = C_+ \big)$
• given that we predict $\mathbf x$ is +
• what's the probability that the decision is correct
• we estimate precision as $P = \cfrac{\text{# TP}}{\text{# predicted positives}} = \cfrac{\text{# TP}}{\text{# TP} + \text{# FP}}$

Interpretation

• Out of all the people we thought have cancer, how many actually had it?
• High precision is good
• we don't tell many people that they have cancer when they actually don't

Recall

Recall

• $\rho = P\big(h_{\theta}(\mathbf x) = C_+ \, \big| \, f(\mathbf x) = C_+ \big)$
• given a positive instance $\mathbf x$
• what's the probability that we predict correctly
• we estimate recall as $R = \cfrac{\text{# TP}}{\text{# actual positives}} = \cfrac{\text{# TP}}{\text{# TP + # FN}}$

Interpretation

• Out of all the people that do actually have cancer, how much we identified?
• The higher the better:
• We don't fail to spot many people that actually have cancer

• For a classifier that always returns zero (i.e. $h_{\theta}(x) = 0$) the Recall would be zero
• That gives us more useful evaluation metric
• And we're much more sure

F Measure

$P$ and $R$ don't make sense in the isolation from each other

• higher level of $\rho$ may be obtained by lowering $\pi$ and vice versa

Suppose we have a ranking classifier that produces some score for $\mathbf x$

• we decide whether to classify it as $C_+$ or $C_-$ based on some threshold parameter $\tau$
• by varying $\tau$ we will get different precision and recall
• improving recall will lead to worse precision
• improving precision will lead to worse recall
• how to pick the threshold?
• combine $P$ and $R$ into one measure (also see ROC Analysis)

$F_\beta = \cfrac{(\beta^2 + 1) P\, R}{\beta^2 \, P + R}$

• $\beta$ is the tradeoff between $P$ and $R$
• if $\beta$ is close to 0, then we give more importance to $P$
• $F_0 = P$
• if $\beta$ is closer to $+ \infty$, we give more importance to $R$

When $\beta = 1$ we have $F_1$ score:

• The $F_1$-score is a single measure of performance of the test.
• it's the harmonic mean of precision $P$ and recall $R$
• $F_1 = 2 \cfrac{P \, R}{P + R}$

Motivation: Precision and Recall

Let's say we trained a Logistic Regression classifier

• we predict 1 if $h_{\theta}(x) \geqslant 0.5$
• we predict 0 if $h_{\theta}(x) < 0.5$

Suppose we want to predict y = 1 (i.e. people have cancer) only if we're very confident

• we may change the threshold to 0.7
• we predict 1 if $h_{\theta}(x) \geqslant 0.7$
• we predict 0 if $h_{\theta}(x) < 0.7$
• We'll have higher precision in this case (all for who we predicted y = 1 are more likely to actually have it)
• But lower recall (we'll miss more patients that actually have cancer, but we failed to spot them)

Let's consider the opposite

• Suppose we want to avoid missing too many cases of y=1 (i.e. we want to avoid false negatives)
• So we may change the threshold to 0.3
• we predict 1 if $h_{\theta}(x) \geqslant 0.3$
• we predict 0 if $h_{\theta}(x) < 0.3$
• Higher recall (we'll correctly flag higher fraction of patients with cancer)
• Lower precision (and higher fraction will turn out to actually have no cancer)

Questions

• Is there a way to automatically choose the threshold for us?
• How to compare precision and recall numbers and decide which algorithm is better?
• at the beginning we had a single number (error ratio) - but now have two and need to choose which one to prefer
• $F_1$ score helps to decide since it's just one number

Example

Suppose we have 3 algorithms $A_1$, $A_2$, $A_3$, and we captured the following metrics:

 $P$ $R$ $\text{Avg}$ $F_1$ $A_1$ 0.5 0.4 0.45 0.444 $\leftarrow$ our choice 0.7 0.1 0.4 0.175 0.02 1.0 0.54 0.0392

Here's the best is $A_1$ because it has the highest $F_1$-score

Precision and Recall for Clustering

Can use precision and recall to evaluate the result of clustering

Correct decisions:

• TP = decision to assign two similar documents to the same cluster
• TN = assign two dissimilar documents to different clusters

Errors:

• FP: assign two dissimilar documents to the same cluster
• FN: assign two similar documents to different clusters

So the confusion matrix is:

 same different TP FN FP TN

Example

Consider the following example (from the IR book )

• • there are $\cfrac{n \, (n - 1)}{2} = 136$ pairs of documents
• $\text{TP} + \text{FP} = {6 \choose 2} + {6 \choose 2} + {5 \choose 2} = 40$
• $\text{TP} = {5 \choose 2} + {4 \choose 2} + {3 \choose 2} + {2 \choose 2} = 20$
• etc

So have the following contingency table:

 same different $\text{TP} = 20$ $\text{FN} = 24$ $\text{FP} = 20$ $\text{TN} = 72$

Thus,

• $P = 20/40 = 0.5$ and $R = 20/44 \approx 0.455$
• $F_1$ score is $F_1 \approx 0.48$

Multi-Class Problems

How do we adapt precision and recall to multi-class problems?

• let $f(\cdot)$ be the target unknown function and $h_\theta(\cdot)$ the model
• let $C_1, \ ... , C_K$ be labels we want to predict ($K$ labels)

Precision w.r.t class $C_i$ is

• $P\big(f(\mathbf x) = C_i \ \big| \ h_\theta(\mathbf x) = C_i \big)$
• probability that given that we classified $\mathbf x$ as $C_i$
• the decision is indeed correct

Recall w.r.t. class $C_i$ is

• $P\big(h_\theta(\mathbf x) = C_i \ \big| \ f(\mathbf x) = C_i \big)$
• given an instance $\mathbf x$ belongs to $C_i$
• what's the probability that we predict correctly

We estimate these probabilities using a contingency table w.r.t each class $C_i$

Contingency Table for $C_i$:

• let $C_+$ be $C_i$ and
• let $C_-$ be all other classes except for $C_i$, i.e. $C_- = \{ C_j \} - C_i$ (all classes except for $i$)
• then we create a contingency table
• and calculate $\text{TP}_i, \text{FP}_i, \text{FN}_i, \text{TN}_i$ for them
• Now estimate precision and recall for class $C_i$

• $P_i = \cfrac{\text{TP}_i}{\text{TP}_i + \text{FP}_i}$
• $R_i = \cfrac{\text{TP}_i}{\text{TP}_i + \text{FN}_i}$

Averaging

• These precision and recall are calculated for each class separately
• how to combine them?

Micro-averaging

• calculate TP, ... etc globally and then average
• let
• $\text{TP} = \sum_i \text{TP}_i$
• $\text{FP} = \sum_i \text{FP}_i$
• $\text{FN} = \sum_i \text{FN}_i$
• $\text{TN} = \sum_i \text{TN}_i$
• and then calculate precision and recall as
• $P^\mu = \cfrac{\text{TP}}{\text{TP} + \text{FP}}$
• $R^\mu = \cfrac{\text{TP}}{\text{TP} + \text{FN}}$

Macro-averaging

• similar to the One-vs-All Classification technique
• calculate $P_i$ and $R_i$ "locally" for each $C_i$
• and then let $P^M = \cfrac{1}{K} \sum_i P_i$ and $R^M = \cfrac{1}{K} \sum_i R_i$

Micro and macro averaging behave quite differently and may give different results

• the ability to behave well on categories with low generality (fewer training examples) will be less emphasized by macroaveraging
• which one to use? depends on application

This way is often used in Document Classification

Sources

• Machine Learning (coursera)
• Sebastiani, Fabrizio. "Machine learning in automated text categorization." (2002). 
• Zhai, ChengXiang. "Statistical language models for information retrieval." 2008.
• Information Retrieval (UFRT)
• Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. "Introduction to information retrieval." 2008.