Stop Words

Stop words are function words:

  • stop words are useful syntactically and grammatically, but don't tell anything about the document content
  • and they are topic-neutral: stop words have the same likelihood of occurring in both relevant and non-relevant documents - so not very useful for Information Retrieval
  • they are present everywhere: usually most frequent words are stop words
  • for example, "the", "a", "an", ...


Stop Words Removal

In many cases stop words are not needed:

  • for example, in Information Retrieval or NLP
  • they don't have enough descriptive power to distinguish between relevant and not relevant documents: all documents have them!
  • so, before indexing they are often removed
  • it also makes the index much smaller
  • it can be seen as a Dimensionality Reduction technique for text data


NLP Pipeline

Stop words removal is a part of the NLP Pipeline


Implementing

Stop words removal in NLTK [1]:

>>> from nltk.corpus import stopwords
>>> stop = stopwords.words('english')
>>> sentence = "this is a foo bar sentence"
>>> print [i for i in sentence.split() if i not in stop]
['foo', 'bar', 'sentence']


Stop Words Usage

There are cases when stop words are not removed, and even used

Examples:

  • author identification ("the little words give authors away")
  • language detection: languages have very distinctive set of stop words, so they can be used to detect the language of a text (see e.g. here [2])


Stop Words Learning

  • Stop words can be learned from the text, usually by looking at top words and manually selecting them
  • But this process can be automated (Wilbur1992):
  • use Term Strength for automatically discovering stop words
  • Term Strength: given a pair of documents, what's the probability that when a term occurs in one document of the pair, it also occurs in another?


Sources

  • Information Retrieval (UFRT)
  • Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. 2012.
  • Wilbur, W. John, "The automatic identification of stop words." 1992. [3]