# ML Wiki

## Q-Q Plot

### Probability Plot

A Probability plot is a technique for comparing two data sets

• e.g. two empirical observations
• or empirical set vs theoretical set

Commonly used:

• P-P plot, "Probability-Probability" or "Percent-Percent" plot;
• Q-Q plot, "Quantile-Quantile" plot, which is more commonly used.

### Normal Probability Plot

It's a special case of Q-Q plots:

• a Q-Q plot against the standard normal distribution;

The normal probability plot is formed by:

• Vertical axis: Ordered response values
• Horizontal axis: Normal order statistic medians or means (see rankit [1])

Constructing

1. order the observations
2. determine the percentile for each
3. identify the $z$-score for each percentile
4. create a Scatterplot
• observation (vertical) vs
• $z$-score (horizontal)

if the data is normally distributed, $z$-scores on the horizontal axis should approximately correspond to their percentiles

## R

### Example 1

Evaluating the Normal Distribution (see [2])

load(url("http://www.openintro.org/stat/data/bdims.RData"))
fdims = subset(bdims, bdims$sex == 0) qqnorm(fdims$hgt, col="orange", pch=19)
qqline(fdims$hgt, lwd=2)  Does it look similar to real Normal Distribution? • it does • let's simulate the normal distribution and compare set.seed(123) sim.norm = rnorm(n=length(fdims$hgt), mean=mean(fdims$hgt), sd=sd(fdims$hgt))
qqnorm(sim.norm, col="orange", pch=19, main="Normal Q-Q Plot of simulated data")
qqline(sim.norm, lwd=2)


Can try to plot several simulations

qqnormsim = function(dat, dim=c(2,2)) {
par(mfrow=dim)
qqnorm(dat, main="Normal QQ Plot (Data)")
qqline(dat)
for (i in 1:(prod(dim) - 1)) {
simnorm <- rnorm(n=length(dat), mean=mean(dat), sd=sd(dat))
qqnorm(simnorm, main = "Normal QQ Plot (Sim)")
qqline(simnorm)
}
par(mfrow=c(1, 1))
}
qqnormsim(fdims$hgt)  Looks like it's indeed normal ### Example 2 (Same data set as in example 1) Let's take a look at another dataset hist(fdims$wgt)


Looks a bit skewed

qqnorm(fdims$wgt, col="orange", pch=19) qqline(fdims$wgt, lwd=2)


qqnormsim(fdims\$wgt)


Most likely not normal