Text Normalization

It's a part of NLP Pipeline for preprocessing text data

  • normalization = applying some linguistic models to tokens of text
  • text tokens often have some minor difference in spelling, but refer to same thing
  • need to recognize such tokens and reduce them to the same common form


Information Retrieval

  • it's important to do text normalization for IR:
  • it reduces the dimensionality of Vector Space Models and the size of the Index


Types

Word Form Normalization

Forms can have many inclinations, but more often they are not important and we need to know only the base form of the word

Can be done by

  • Stemming: keeping only the root of the word (usually just deleting suffixes)
    • economy, economic, economical, economically, economics, economize => econom
  • Lemmatization: keeping only the lemma
  • produce, produces, product, production => produce


Phonetic Normalization

In English words that are pronounced the same way can be spelled differently

  • in some IR applications need to account for that
  • use phonetic normalization to reduce similar-sounding words to the same token


Acronyms

Countries

  • the US -> USA
  • U.S.A. -> USA

Organizations

  • UN -> United Nations


Accents / Umlauts

  • naïve -> naive
  • météo -> meteo
  • or can be the other way around - depending on application


Capital Letters

In many cases capital letter aren't needed

  • Product -> product
  • usually the way to handle it is to lovercase all the letters


Careful: sometimes capitalization is needed


Values

Sometimes we want to enforce some specific format on some values of some types

  • e.g:
  • phones (+7 (800) 123 1231, 8-800-123-1231 => 0078001231231)
  • dates, times (e.g. 25 June 2015, 25.06.15 => 2015.06.25)
  • currency (\$400 => 400 dollars)
  • addresses


Often we don't care about specific value, only what this value mean, so we can do the following normalization:

  • \$400 => MONEY
  • email@gmail.com => EMAIL
  • 25 June 2015 => DATE
  • +7 (800) 123 1231 => PHONE
  • etc


Spelling Correction

Also in Natural Languages there are spelling mistakes

  • In many applications it's useful to correct them
  • e.g. infromation -> information


Applications

Often text normalization can be seen as a set Dimensionality Reduction techniques applied to term-document matrices


Sources