Prioritizing

  • Suppose we want to build a spam classifier
  • How we should spend time to do that?

Options:

  • Collect more data to have more samples
  • Develop sophisticated features based on email routing etc
  • Develop sophisticated algorithm to detect misspelled words such as m0rtgage etc

How to choose what is better?


Error Analysis

Recommended Approach

  • Start with the simplest possible algorithm (avoid premature optimization!) that you can implement quickly
  • Plot Learning Curves to decide if more data features is likely to help
  • Do the Error Analysis
    • Manually examine the examples (in your CV set) that your algorithm misclassified.
    • See if you spot any systematic trend in what types of examples it makes errors on


Example

$m_{\text{cv}} = 500$, and our algorithm misclassifies 100 of them

so we manually examine the 100 errors and categorize them based on

  • what type of email it is
    • pharmacy 12
    • replica 4
    • steal password 53 - seems it's worth investing time in this category!
    • other 31
  • what features might help the algorithm to classify it correctly
    • deliberate misspelling 15
    • unusual email routing 16
    • unusual punctuation 32 - so concentrate on this!


Numerical Evaluation

  • Error Analysis may not be helpful for deciding if this is likely to improve performance
  • The only solution in this case is to try it and see if it works
  • But we need a numerical evaluation (e.g. Cross-Validation error) of algorithm's performance with and without the new code/idea/etc
  • So we need to use Error Metrics


See also

Sources