Stemming

Stemming is a part of NLP Pipeline useful in Text Mining and Information Retrieval

  • stemming is an algorithm that extract the morphological root of a word


Usage:


Algorithms

Need to reduce words to a stem (root) form

  • use language-dependent rules
  • usually they are in a form of Automaton that gradually reduces a token to its stem
  • for example, there's a Porter Algorithm and Snowball Stemmer


Porter Stemmer

It's a bunch of rules for reducing a word:

  • sses -> es
  • ies -> i
  • ational -> ate
  • tional -> tion
  • s -> $\varnothing$
  • when conflicts, the longest rule wins

Example

  • economy, economic, economical, economically, economics, economize => econom
  • automates, automatic, automation => automat


Snowball Stemmer

Better stemmer than Porter


Programming

Python / NLTK

from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer('english')
stem = snowball_stemmer.stem(unigram)


Downsides

  • often does wrong replacement and bad reduction
  • e.g. universe -> univers, university -> univers: different words, same stem
  • in applications where it's important to distinguish between these words, use Lemmatization instead (although it's more computationally expensive)


Sources