ML Wiki
Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

Summarizing Data

Summarizing Data

Before we do any Data Analysis, need to see if data is good

Why?

  • Data too big to look at
  • Need to find problems before analyzing

Problems:

  • Missing values
  • Values outside of expected ranges
  • Values that seem to be in the wrong units
  • Mislabeled variables/columns
  • Variables that are the wrong class

Summarizing Data in R

Summary Statistics

  • summary(x) - summarizes all quantitative and qualitative variables
  • quantile(x) - range of variables

sapply(x[1, ], class)

  • calls class for every element of the 1st row
  • tells if data was loaded properly

names(x)

  • columns’ names

Sizes:

  • dim(x) - size of the dataset
  • same as nrow(x) and ncol(x)
  • length(x) and unique(x)

tables

  • table(x) - unique + counter
  • table(x, y) - two-dimensional table

logical tests

  • any(x > 10) - are there any TRUEs?
  • all(x > 10) - are all trues?
  • which(x > 10) - which elements are TRUEs?
  • which(is.na(x)) - which are NAs
  • use | not, & and, | or: - which(| is.na(x) & x > 10) - sum(is.na(x)) - how many NAs

summarizing by columns or rows

  • rowSums, rowMeans
  • colSums, colMeans

Source