ML Wiki

Prioritizing

Options:

How to choose what is better?

Recommended Approach

Start with the simplest possible algorithm (avoid premature optimization!) that you can implement quickly
- Implement it and test it on your cross-validation set
Plot Learning Curves to decide if more data features is likely to help
Do the Error Analysis
- Manually examine the examples (in your CV set) that your algorithm misclassified.
- See if you spot any systematic trend in what types of examples it makes errors on

$m_{\text{cv}} = 500$, and our algorithm misclassifies 100 of them

so we manually examine the 100 errors and categorize them based on

what type of email it is
- pharmacy 12
- replica 4
- steal password 53 - seems it's worth investing time in this category!
- other 31
what features might help the algorithm to classify it correctly
- deliberate misspelling 15
- unusual email routing 16
- unusual punctuation 32 - so concentrate on this!

Error Analysis may not be helpful for deciding if this is likely to improve performance
The only solution in this case is to try it and see if it works
But we need a numerical evaluation (e.g. Cross-Validation error) of algorithm's performance with and without the new code/idea/etc
So we need to use Error Metrics