Tokenization
Tokenization is a part of NLP Pipeline and it’s common in almost any NLP or Information Retrieval task
Tokenization can be of two types:
- Decompose text into sentences
- Decompose sentences into tokens
Word Split
Usual tokenization is given a text, split it s.t. individual words can be accessed
For example
- “The quick brown fox jumps over the lazy dog” ->
- [‘The’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’]
Need to be careful with special cases:
- Numbers
- Los Angeles - may be one token, not two
- Punctuation is important:
- email@gmail.com - dot inside email
- U.S.A. - watch out for dots inside the token
- Mr. Durand - one person
- see also Text Normalization
In some languages it’s difficult
- e.g. German, Chinese
Sentence Split
Main challenge: distinguish between full stop dot and dot in abbreviations
NLP Pipeline
- Tokenization is usually the very first step in NLP and IR applications
- Then it can be followed by
- Stop Word Removal
- Lemmatization
- building a Vector Space Model or Inverted Index
- etc