Noise Handling (Data Mining)
Noise - a modification of the original value
- Typically very hard to detect
- Unlike Outliers, that are noticeable different from all other values
- Noisy data looks like real data
Reasons for Noise:
- Faulty data collection instruments
- people don't want to put data and put some garbage
- e.g. age - 40 - true or false?
- Data entry or transmission problems
Detecting and Handling
There are several techniques
- Cluster Analysis build clusters and then see if there are values that shouldn't belong to this cluster
- Build some model, and then run it on the original data set
- misclassified instances can be due to noise - are there strange?
-
Sources