Data Cleaning
There can be several problems with data
- Missing values - NAs, NULLs, empty or blank values
- Outliers - extreme values
- Noise in data - modifications of the original value, hard to detect
- Duplicates
Main Problems and Tools
Handling Missing Values
There are several approaches
- radical: ignore row/column
- fill with default value or mean
- build a Machine Learning model to predict missing values
Outliers Detection
Outliers are extreme values in the data
- can influence your models, e.g. Linear Regression
- so itโs a good idea to detect them
- use Anomaly Detection techniques for that
Handling Noise
noise - modification of an original value
- very hard to detect - because noisy data looks like real data
Duplicate Detection
Duplicate Data: major issue when you merge data from different sources