Summarizing Data

Before we do any Data Analysis, need to see if data is good

Why?

  • Data too big to look at
  • Need to find problems before analyzing


Problems:

  • Missing values
  • Values outside of expected ranges
  • Values that seem to be in the wrong units
  • Mislabeled variables/columns
  • Variables that are the wrong class


Summarizing Data in R

Summary Statistics

  • summary(x) - summarizes all quantitative and qualitative variables
  • quantile(x) - range of variables

sapply(x[1, ], class)

  • calls class for every element of the 1st row
  • tells if data was loaded properly

names(x)

  • columns' names

Sizes:

  • dim(x) - size of the dataset
  • same as nrow(x) and ncol(x)
  • length(x) and unique(x)

tables

  • table(x) - unique + counter
  • table(x, y) - two-dimensional table


logical tests

  • any(x > 10) - are there any TRUEs?
  • all(x > 10) - are all trues?
  • which(x > 10) - which elements are TRUEs?
  • which(is.na(x)) - which are NAs
  • use ! not, & and, | or:
    • which(!is.na(x) & x > 10)
  • sum(is.na(x)) - how many NAs


summarizing by columns or rows

  • rowSums, rowMeans
  • colSums, colMeans


Source