# ML Wiki

## Text Normalization

It's a part of NLP Pipeline for preprocessing text data

• normalization = applying some linguistic models to tokens of text
• text tokens often have some minor difference in spelling, but refer to same thing
• need to recognize such tokens and reduce them to the same common form
• it's important to do text normalization for IR:
• it reduces the dimensionality of Vector Space Models and the size of the Index

## Types

### Word Form Normalization

Forms can have many inclinations, but more often they are not important and we need to know only the base form of the word

Can be done by

• Stemming: keeping only the root of the word (usually just deleting suffixes)
• economy, economic, economical, economically, economics, economize => econom
• Lemmatization: keeping only the lemma
• produce, produces, product, production => produce

### Phonetic Normalization

In English words that are pronounced the same way can be spelled differently

• in some IR applications need to account for that
• use phonetic normalization to reduce similar-sounding words to the same token

### Acronyms

Countries

• the US -> USA
• U.S.A. -> USA

Organizations

• UN -> United Nations

### Accents / Umlauts

• naïve -> naive
• météo -> meteo
• or can be the other way around - depending on application

### Capital Letters

In many cases capital letter aren't needed

• Product -> product
• usually the way to handle it is to lovercase all the letters

Careful: sometimes capitalization is needed

### Values

Sometimes we want to enforce some specific format on some values of some types

• e.g:
• phones (+7 (800) 123 1231, 8-800-123-1231 => 0078001231231)
• dates, times (e.g. 25 June 2015, 25.06.15 => 2015.06.25)
• currency (\$400 => 400 dollars) • addresses Often we don't care about specific value, only what this value mean, so we can do the following normalization: • \$400 => MONEY
• email@gmail.com => EMAIL
• 25 June 2015 => DATE
• +7 (800) 123 1231 => PHONE
• etc

### Spelling Correction

Also in Natural Languages there are spelling mistakes

• In many applications it's useful to correct them
• e.g. infromation -> information

## Applications

Often text normalization can be seen as a set Dimensionality Reduction techniques applied to term-document matrices