# ML Wiki

## Stop Words

Stop words are function words:

• stop words are useful syntactically and grammatically, but don't tell anything about the document content
• and they are topic-neutral: stop words have the same likelihood of occurring in both relevant and non-relevant documents - so not very useful for Information Retrieval
• they are present everywhere: usually most frequent words are stop words
• for example, "the", "a", "an", ...

## Stop Words Removal

In many cases stop words are not needed:

• for example, in Information Retrieval or NLP
• they don't have enough descriptive power to distinguish between relevant and not relevant documents: all documents have them!
• so, before indexing they are often removed
• it also makes the index much smaller
• it can be seen as a Dimensionality Reduction technique for text data

### NLP Pipeline

Stop words removal is a part of the NLP Pipeline

### Implementing

Stop words removal in NLTK [1]:

>>> from nltk.corpus import stopwords
>>> stop = stopwords.words('english')
>>> sentence = "this is a foo bar sentence"
>>> print [i for i in sentence.split() if i not in stop]
['foo', 'bar', 'sentence']


## Stop Words Usage

There are cases when stop words are not removed, and even used

Examples:

• author identification ("the little words give authors away")
• language detection: languages have very distinctive set of stop words, so they can be used to detect the language of a text (see e.g. here [2])

## Stop Words Learning

• Stop words can be learned from the text, usually by looking at top words and manually selecting them
• But this process can be automated (Wilbur1992):
• use Term Strength for automatically discovering stop words
• Term Strength: given a pair of documents, what's the probability that when a term occurs in one document of the pair, it also occurs in another?

## Sources

• Information Retrieval (UFRT)
• Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. 2012.
• Wilbur, W. John, "The automatic identification of stop words." 1992. [3]