Precision and Recall

Precision and Recall are quality metrics used across many domains:


Precision and Recall for Information Retrieval

IR system has to be:

  • precise: all returned document should be relevant
  • efficient: all relevant document should be returned


Given a test collection, the quality of an IR system is evaluated with:

  • Precision: % of relevant documents in the result
  • Recall: % of retrieved relevant documents


More formally,

  • given a collection of documents $C$
  • If $X \subseteq C$ is the output of the IR system and $Y \subseteq C$ is the list of all relevant documents then define
  • precision as $P = \cfrac{|X \cup Y|}{| X|}$ and recall as $R = \cfrac{|X \cup Y|}{| Y |}$
  • both $P$ and $R$ are defined w.r.t a set of retrieved documents


Precision/Recall Curves

  • If we retrieve more document, we improve recall (if return all docs, $R = 1$)
  • if we retrieve fewer documents, we improve precision, but reduce recall
  • so there's a trade-off between them


Let $k$ be the number of retrieved documents

  • then by varying $k$ from $0$ to $N = |C|$ we can draw $P$ vs $R$ and obtain the Precision/recall curve:
  • 26bc1ea381424262b9d966c63f418661.png source: [1]
  • the closer the curve to the $(1, 1)$ point - the better the IR system performance
  • 010ded77e8d0454b99f0cafd3d962613.png source: Information Retrieval (UFRT) lecture 2


Area under P/R Curve:

  • Analogously to ROC Curves we can calculate the area under the P/R Curve
  • the closer AUPR to 1 the better


Average Precision

Top-$k$-precision is insensitive to change of ranks of relevant documents among top $k$

how to measure overall performance of an IR system?

$\text{avg P} = \cfrac{1}{K} \sum_{k = 1}^K \cfrac{k}{r_k}$

  • where $r_i$ is the rank of $k$th relevant document in the result

Since in a test collection we usually have a set of queries, we calcuate the average over them and get Mean Average Precision: MAP


Precision and Recall for Classification

The precision and recall metrics can also be applied to Machine Learning: to binary classifiers


Diagnostic Testing Measures [2]
Actual Class $y$
Positive Negative
$h_{\theta}(x)$
Test
outcome
Test
outcome
positive
True positive
($\text{TP}$)
False positive
($\text{FP}$, Type I error)
Precision =
$\cfrac{\# \text{TP}}{\# \text{TP} + \# \text{FP}}$
Test
outcome
negative
False negative
($\text{FN}$, Type II error)
True negative
($\text{TN}$)
Negative predictive value =
$\cfrac{\# \text{TN}}{\# \text{FN} + \# \text{TN}}$
Sensitivity =
$\cfrac{\# \text{TP}}{\# \text{TP} + \# \text{FN}}$
Specificity =
$\cfrac{\# \text{TN}}{\# \text{FP} + \# \text{TN}}$
Accuracy =
$\cfrac{\# \text{TP} + \# \text{TN}}{\# \text{TOTAL}}$


Main values of this matrix:

  • True Positive - we predicted "+" and the true class is "+"
  • True Negative - we predicted "-" and the true class is "-"
  • False Positive - we predicted "+" and the true class is "-" (Type I error)
  • False Negative - we predicted "-" and the true class is "+" (Type II error)


Two Classes: $C_+$ and $C_-$


Precision

Precision

  • $\pi = P\big(f(\mathbf x) = C_+ \, \big| \, h_{\theta}(\mathbf x) = C_+ \big)$
  • given that we predict $\mathbf x$ is +
  • what's the probability that the decision is correct
  • we estimate precision as $P = \cfrac{\text{# TP}}{\text{# predicted positives}} = \cfrac{\text{# TP}}{\text{# TP} + \text{# FP}}$


Interpretation

  • Out of all the people we thought have cancer, how many actually had it?
  • High precision is good
  • we don't tell many people that they have cancer when they actually don't


Recall

Recall

  • $\rho = P\big(h_{\theta}(\mathbf x) = C_+ \, \big| \, f(\mathbf x) = C_+ \big)$
  • given a positive instance $\mathbf x$
  • what's the probability that we predict correctly
  • we estimate recall as $R = \cfrac{\text{# TP}}{\text{# actual positives}} = \cfrac{\text{# TP}}{\text{# TP + # FN}}$


Interpretation

  • Out of all the people that do actually have cancer, how much we identified?
  • The higher the better:
  • We don't fail to spot many people that actually have cancer


  • For a classifier that always returns zero (i.e. $h_{\theta}(x) = 0$) the Recall would be zero
  • That gives us more useful evaluation metric
  • And we're much more sure


F Measure

$P$ and $R$ don't make sense in the isolation from each other

  • higher level of $\rho$ may be obtained by lowering $\pi$ and vice versa


Suppose we have a ranking classifier that produces some score for $\mathbf x$

  • we decide whether to classify it as $C_+$ or $C_-$ based on some threshold parameter $\tau$
  • by varying $\tau$ we will get different precision and recall
  • improving recall will lead to worse precision
  • improving precision will lead to worse recall
  • how to pick the threshold?
  • combine $P$ and $R$ into one measure (also see ROC Analysis)


$F_\beta = \cfrac{(\beta^2 + 1) P\, R}{\beta^2 \, P + R}$

  • $\beta$ is the tradeoff between $P$ and $R$
  • if $\beta$ is close to 0, then we give more importance to $P$
    • $F_0 = P$
  • if $\beta$ is closer to $+ \infty$, we give more importance to $R$


When $\beta = 1$ we have $F_1$ score:

  • The $F_1$-score is a single measure of performance of the test.
  • it's the harmonic mean of precision $P$ and recall $R$
  • $F_1 = 2 \cfrac{P \, R}{P + R}$


Motivation: Precision and Recall

Let's say we trained a Logistic Regression classifier

  • we predict 1 if $h_{\theta}(x) \geqslant 0.5$
  • we predict 0 if $h_{\theta}(x) < 0.5$

Suppose we want to predict y = 1 (i.e. people have cancer) only if we're very confident

  • we may change the threshold to 0.7
    • we predict 1 if $h_{\theta}(x) \geqslant 0.7$
    • we predict 0 if $h_{\theta}(x) < 0.7$
  • We'll have higher precision in this case (all for who we predicted y = 1 are more likely to actually have it)
  • But lower recall (we'll miss more patients that actually have cancer, but we failed to spot them)

Let's consider the opposite

  • Suppose we want to avoid missing too many cases of y=1 (i.e. we want to avoid false negatives)
  • So we may change the threshold to 0.3
    • we predict 1 if $h_{\theta}(x) \geqslant 0.3$
    • we predict 0 if $h_{\theta}(x) < 0.3$
  • That leads to
  • Higher recall (we'll correctly flag higher fraction of patients with cancer)
  • Lower precision (and higher fraction will turn out to actually have no cancer)


Questions

  • Is there a way to automatically choose the threshold for us?
  • How to compare precision and recall numbers and decide which algorithm is better?
  • at the beginning we had a single number (error ratio) - but now have two and need to choose which one to prefer
  • $F_1$ score helps to decide since it's just one number


Example

Suppose we have 3 algorithms $A_1$, $A_2$, $A_3$, and we captured the following metrics:


$P$ $R$ $\text{Avg}$ $F_1$
$A_1$ 0.5 0.4 0.45 0.444 $\leftarrow$ our choice
$A_2$ 0.7 0.1 0.4 0.175
$A_3$ 0.02 1.0 0.54 0.0392


Here's the best is $A_1$ because it has the highest $F_1$-score


Precision and Recall for Clustering

Can use precision and recall to evaluate the result of clustering

Correct decisions:

  • TP = decision to assign two similar documents to the same cluster
  • TN = assign two dissimilar documents to different clusters

Errors:

  • FP: assign two dissimilar documents to the same cluster
  • FN: assign two similar documents to different clusters


So the confusion matrix is:

same different
same TP FN
different FP TN


Example

Consider the following example (from the IR book [3])

  • img1393.png
  • there are $\cfrac{n \, (n - 1)}{2} = 136$ pairs of documents
  • $\text{TP} + \text{FP} = {6 \choose 2} + {6 \choose 2} + {5 \choose 2} = 40$
  • $\text{TP} = {5 \choose 2} + {4 \choose 2} + {3 \choose 2} + {2 \choose 2} = 20$
  • etc

So have the following contingency table:

same different
same $\text{TP} = 20$ $\text{FN} = 24$
different $\text{FP} = 20$ $\text{TN} = 72$


Thus,

  • $P = 20/40 = 0.5$ and $R = 20/44 \approx 0.455$
  • $F_1$ score is $F_1 \approx 0.48$


Multi-Class Problems

How do we adapt precision and recall to multi-class problems?

  • let $f(\cdot)$ be the target unknown function and $h_\theta(\cdot)$ the model
  • let $C_1, \ ... , C_K$ be labels we want to predict ($K$ labels)


Precision w.r.t class $C_i$ is

  • $P\big(f(\mathbf x) = C_i \ \big| \ h_\theta(\mathbf x) = C_i \big)$
  • probability that given that we classified $\mathbf x$ as $C_i$
  • the decision is indeed correct


Recall w.r.t. class $C_i$ is

  • $P\big(h_\theta(\mathbf x) = C_i \ \big| \ f(\mathbf x) = C_i \big)$
  • given an instance $\mathbf x$ belongs to $C_i$
  • what's the probability that we predict correctly


We estimate these probabilities using a contingency table w.r.t each class $C_i$


Contingency Table for $C_i$:

  • let $C_+$ be $C_i$ and
  • let $C_-$ be all other classes except for $C_i$, i.e. $C_- = \{ C_j \} - C_i$ (all classes except for $i$)
  • then we create a contingency table
  • and calculate $\text{TP}_i, \text{FP}_i, \text{FN}_i, \text{TN}_i$ for them
  • 468b51be729a42ff8195b6fc05292508.png


Now estimate precision and recall for class $C_i$

  • $P_i = \cfrac{\text{TP}_i}{\text{TP}_i + \text{FP}_i}$
  • $R_i = \cfrac{\text{TP}_i}{\text{TP}_i + \text{FN}_i}$


Averaging

  • These precision and recall are calculated for each class separately
  • how to combine them?


Micro-averaging

  • calculate TP, ... etc globally and then calculate Precision and Recall
  • let
    • $\text{TP} = \sum_i \text{TP}_i$
    • $\text{FP} = \sum_i \text{FP}_i$
    • $\text{FN} = \sum_i \text{FN}_i$
    • $\text{TN} = \sum_i \text{TN}_i$
  • and then calculate precision and recall as
    • $P^\mu = \cfrac{\text{TP}}{\text{TP} + \text{FP}}$
    • $R^\mu = \cfrac{\text{TP}}{\text{TP} + \text{FN}}$


Macro-averaging

  • similar to the One-vs-All Classification technique
  • calculate $P_i$ and $R_i$ "locally" for each $C_i$
  • and then let $P^M = \cfrac{1}{K} \sum_i P_i$ and $R^M = \cfrac{1}{K} \sum_i R_i$


Micro and macro averaging behave quite differently and may give different results

  • the ability to behave well on categories with low generality (fewer training examples) will be less emphasized by macroaveraging
  • which one to use? depends on application


This way is often used in Document Classification

Sources

  • Machine Learning (coursera)
  • Sebastiani, Fabrizio. "Machine learning in automated text categorization." (2002). [4]
  • Zhai, ChengXiang. "Statistical language models for information retrieval." 2008.
  • Information Retrieval (UFRT)
  • Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. "Introduction to information retrieval." 2008. [5]