Noise Handling (Data Mining)

Noise - a modification of the original value

  • Typically very hard to detect
  • Unlike Outliers, that are noticeable different from all other values
  • Noisy data looks like real data

Reasons for Noise:

  • Faulty data collection instruments
  • people don't want to put data and put some garbage
  • e.g. age - 40 - true or false?
  • Data entry or transmission problems

Detecting and Handling

There are several techniques

  • Cluster Analysis build clusters and then see if there are values that shouldn't belong to this cluster
  • Build some model, and then run it on the original data set
    • misclassified instances can be due to noise - are there strange?
    • noise-regression.png