Data Cleaning
There can be several problems with data
- Missing values - NAs, NULLs, empty or blank values
- Outliers - extreme values
- Noise in data - modifications of the original value, hard to detect
- Duplicates
Main Problems and Tools
There are several approaches
- radical: ignore row/column
- fill with default value or mean
- build a Machine Learning model to predict missing values
Outliers Detection
Outliers are extreme values in the data
noise - modification of an original value
- very hard to detect - because noisy data looks like real data
Duplicate Data: major issue when you merge data from different sources
See Also
Sources