Data Cleaning

There can be several problems with data

  • Missing values - NAs, NULLs, empty or blank values
  • Outliers - extreme values
  • Noise in data - modifications of the original value, hard to detect
  • Duplicates


Main Problems and Tools

Handling Missing Values

There are several approaches

  • radical: ignore row/column
  • fill with default value or mean
  • build a Machine Learning model to predict missing values


Outliers Detection

Outliers are extreme values in the data


Handling Noise

noise - modification of an original value

  • very hard to detect - because noisy data looks like real data


Duplicate Detection

Duplicate Data: major issue when you merge data from different sources


See Also

Sources