Stemming
Stemming is a part of NLP Pipeline useful in Text Mining and Information Retrieval
- stemming is an algorithm that extract the morphological root of a word
Usage:
Algorithms
Need to reduce words to a stem (root) form
- use language-dependent rules
- usually they are in a form of Automaton that gradually reduces a token to its stem
- for example, there's a Porter Algorithm and Snowball Stemmer
Porter Stemmer
It's a bunch of rules for reducing a word:
- sses -> es
- ies -> i
- ational -> ate
- tional -> tion
- s -> $\varnothing$
- when conflicts, the longest rule wins
Example
- economy, economic, economical, economically, economics, economize => econom
- automates, automatic, automation => automat
Snowball Stemmer
Better stemmer than Porter
Programming
Python / NLTK
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer('english')
stem = snowball_stemmer.stem(unigram)
Downsides
- often does wrong replacement and bad reduction
- e.g. universe -> univers, university -> univers: different words, same stem
- in applications where it's important to distinguish between these words, use Lemmatization instead (although it's more computationally expensive)
Sources