Stop Words
Stop words are function words:
- stop words are useful syntactically and grammatically, but don't tell anything about the document content
- and they are topic-neutral: stop words have the same likelihood of occurring in both relevant and non-relevant documents - so not very useful for Information Retrieval
- they are present everywhere: usually most frequent words are stop words
- for example, "the", "a", "an", ...
Stop Words Removal
In many cases stop words are not needed:
- for example, in Information Retrieval or NLP
- they don't have enough descriptive power to distinguish between relevant and not relevant documents: all documents have them!
- so, before indexing they are often removed
- it also makes the index much smaller
- it can be seen as a Dimensionality Reduction technique for text data
Stop words removal is a part of the NLP Pipeline
Implementing
Stop words removal in NLTK [1]:
>>> from nltk.corpus import stopwords
>>> stop = stopwords.words('english')
>>> sentence = "this is a foo bar sentence"
>>> print [i for i in sentence.split() if i not in stop]
['foo', 'bar', 'sentence']
Stop Words Usage
There are cases when stop words are not removed, and even used
Examples:
- author identification ("the little words give authors away")
- language detection: languages have very distinctive set of stop words, so they can be used to detect the language of a text (see e.g. here [2])
Stop Words Learning
- Stop words can be learned from the text, usually by looking at top words and manually selecting them
- But this process can be automated (Wilbur1992):
- use Term Strength for automatically discovering stop words
- Term Strength: given a pair of documents, what's the probability that when a term occurs in one document of the pair, it also occurs in another?
Sources
- Information Retrieval (UFRT)
- Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. 2012.
- Wilbur, W. John, "The automatic identification of stop words." 1992. [3]