Phonetic Normalization
Phonetic normalization is a form of Text Normalization done for Information Retrieval applications
Problem:
- in English (and many other languages) words that are pronounced the same way can be spelled differently
- some IR applications need to account for that
- use phonetic normalization to reduce similar-sounding words to the same token
So, phonetic normalization algorithm should:
- facilitate the retrieval of words with similar sound.
Soundex
Soundex is a phonetic normalization algorithm
- encodes words according to their pronunciation
- each word is compressed into a 4 characters code: Soundex code.
Algorithm:
- keep the first letter of the name
- drop all a, e, i, o, u, y, h, w.
- replace similar-sounding ("phonetically clone") consonants with digits:
- bfpv -> 1;
- cgjkqsxz -> 2;
- dt -> 3;
- l -> 4;
- mn -> 5;
- r -> 6;
- now remove all consequent occurrences of the same digit
- keep only first four characters of the resulting string (append with zeros if needed)
Examples:
- Herman -> H655
- Veronika, Veronique -> V652
Usage
Useful for
- First and last names
- Street names
- etc
Sources