Stop Words
Stop words are ‘‘function words’’:
- stop words are useful syntactically and grammatically, but don’t tell anything about the document content
- and they are topic-neutral: stop words have the same likelihood of occurring in both relevant and non-relevant documents - so not very useful for Information Retrieval
- they are present everywhere: usually most frequent words are stop words
- for example, “the”, “a”, “an”, …
Stop Words Removal
In many cases stop words are not needed:
- for example, in Information Retrieval or NLP
-
they don’t have enough descriptive power to distinguish between relevant and not relevant documents: all documents have them - so, before indexing they are often removed - it also makes the index much smaller - it can be seen as a Dimensionality Reduction technique for text data
NLP Pipeline
Stop words removal is a part of the NLP Pipeline
- for building Inverted Index
- for building Vector Space Model
Implementing
- English: http://www.ranks.nl/stopwords
- http://www.textfixer.com/resources/common-english-words.txt
Stop words removal in NLTK [http://stackoverflow.com/questions/19130512/stopword-removal-with-nltk]:
>>> from nltk.corpus import stopwords
>>> stop = stopwords.words('english')
>>> sentence = "this is a foo bar sentence"
>>> print [i for i in sentence.split() if i not in stop]
['foo', 'bar', 'sentence']
Stop Words Usage
There are cases when stop words are not removed, and even used
Examples:
- author identification (“the little words give authors away”)
- language detection: languages have very distinctive set of stop words, so they can be used to detect the language of a text (see e.g. here [http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/])
Stop Words Learning
- Stop words can be learned from the text, usually by looking at top words and manually selecting them
- But this process can be automated (Wilbur1992):
- use Term Strength for automatically discovering stop words
- Term Strength: given a pair of documents, what’s the probability that when a term occurs in one document of the pair, it also occurs in another?
Sources
- Information Retrieval (UFRT)
- Aggarwal, Charu C., and ChengXiang Zhai. “A survey of text clustering algorithms.” Mining Text Data. 2012.
- Wilbur, W. John, “The automatic identification of stop words.” 1992. [http://www.researchgate.net/publication/247786801_The_automatic_identification_of_stop_words]