ML Wiki

Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

Tokenization

information-retrieval nlp

Tokenization

Tokenization is a part of NLP Pipeline and it’s common in almost any NLP or Information Retrieval task

Tokenization can be of two types:

Decompose text into sentences
Decompose sentences into tokens

Word Split

Usual tokenization is given a text, split it s.t. individual words can be accessed

For example

“The quick brown fox jumps over the lazy dog” ->
[‘The’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’]

Need to be careful with special cases:

Numbers
Los Angeles - may be one token, not two
Punctuation is important:
email@gmail.com - dot inside email
U.S.A. - watch out for dots inside the token
Mr. Durand - one person
see also Text Normalization

In some languages it’s difficult

e.g. German, Chinese

Sentence Split

Main challenge: distinguish between full stop dot and dot in abbreviations

NLP Pipeline

Tokenization is usually the very first step in NLP and IR applications
Then it can be followed by
Stop Word Removal
Lemmatization
building a Vector Space Model or Inverted Index
etc

Sources

Information Retrieval (UFRT)

✏️ Edit on GitHub