Tokenization
Tokenization is a part of NLP Pipeline and it's common in almost any NLP or Information Retrieval task
Tokenization can be of two types:
- Decompose text into sentences
- Decompose sentences into tokens
Word Split
Usual tokenization is given a text, split it s.t. individual words can be accessed
For example
- "The quick brown fox jumps over the lazy dog" ->
- ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
Need to be careful with special cases:
- Numbers
- Los Angeles - may be one token, not two
- Punctuation is important:
- email@gmail.com - dot inside email
- U.S.A. - watch out for dots inside the token
- Mr. Durand - one person
- see also Text Normalization
In some languages it's difficult
Sentence Split
Main challenge: distinguish between full stop dot and dot in abbreviations
Sources