Tokenization is a part of NLP Pipeline and it's common in almost any NLP or Information Retrieval task

Tokenization can be of two types:

  • Decompose text into sentences
  • Decompose sentences into tokens

Word Split

Usual tokenization is given a text, split it s.t. individual words can be accessed

For example

  • "The quick brown fox jumps over the lazy dog" ->
  • ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Need to be careful with special cases:

  • Numbers
  • Los Angeles - may be one token, not two
  • Punctuation is important:
  • - dot inside email
  • U.S.A. - watch out for dots inside the token
  • Mr. Durand - one person
  • see also Text Normalization

In some languages it's difficult

  • e.g. German, Chinese

Sentence Split

Main challenge: distinguish between full stop dot and dot in abbreviations

NLP Pipeline