Q-Q Plot

Probability Plot

A Probability plot is a technique for comparing two data sets

  • e.g. two empirical observations
  • or empirical set vs theoretical set

Commonly used:

  • P-P plot, "Probability-Probability" or "Percent-Percent" plot;
  • Q-Q plot, "Quantile-Quantile" plot, which is more commonly used.


Normal Probability Plot

It's a special case of Q-Q plots:

  • a Q-Q plot against the standard normal distribution;


The normal probability plot is formed by:

  • Vertical axis: Ordered response values
  • Horizontal axis: Normal order statistic medians or means (see rankit [1])


Constructing

  1. order the observations
  2. determine the percentile for each
  3. identify the $z$-score for each percentile
  4. create a Scatterplot
    • observation (vertical) vs
    • $z$-score (horizontal)


if the data is normally distributed, $z$-scores on the horizontal axis should approximately correspond to their percentiles


R

Example 1

Evaluating the Normal Distribution (see [2])

load(url("http://www.openintro.org/stat/data/bdims.RData"))
fdims = subset(bdims, bdims$sex == 0)

qqnorm(fdims$hgt, col="orange", pch=19)
qqline(fdims$hgt, lwd=2)

fb07c242281d4b25911459e38f3f1d58.png

Does it look similar to real Normal Distribution?

  • it does
  • let's simulate the normal distribution and compare
set.seed(123)
sim.norm = rnorm(n=length(fdims$hgt), mean=mean(fdims$hgt), sd=sd(fdims$hgt))
qqnorm(sim.norm, col="orange", pch=19, main="Normal Q-Q Plot of simulated data")
qqline(sim.norm, lwd=2)

471d9f11a690436f96f56ad0c4c544c4.png


Can try to plot several simulations

qqnormsim = function(dat, dim=c(2,2)) {
  par(mfrow=dim)
  qqnorm(dat, main="Normal QQ Plot (Data)")
  qqline(dat)
  for (i in 1:(prod(dim) - 1)) {
    simnorm <- rnorm(n=length(dat), mean=mean(dat), sd=sd(dat))
    qqnorm(simnorm, main = "Normal QQ Plot (Sim)")
    qqline(simnorm)
  }
  par(mfrow=c(1, 1))
}
qqnormsim(fdims$hgt)

8280c1c21ec94cd69916fc92d26dfe3b.png

Looks like it's indeed normal


Example 2

(Same data set as in example 1)

Let's take a look at another dataset

hist(fdims$wgt)

600799aa1fd24b03beed1d063fd7cb0f.png

Looks a bit skewed

qqnorm(fdims$wgt, col="orange", pch=19)
qqline(fdims$wgt, lwd=2)

fbabb494c4554aa8b9c88d58b0ae0213.png

qqnormsim(fdims$wgt)

5cabf607296141b5b4297fe749f1bbd2.png

Most likely not normal


Sources