Confidence Intervals for Means

r statistics

Confidence Intervals for Means

We want to build a Confidence Interval for a Point Estimate of the population mean

Problem

let $\mu$ be the true mean parameter
$X_1, …, X_n$ our sample of size $n$
$\bar{X}$ - the average value for the sample $\bar{X} = \cfrac{1}{n} \sum_{i=1}^n X_i$
we want to estimate $\mu$ using $\bar{X}$ and a Confidence Interval around it

Normal Model

With sufficiently large sample and no violations of the assumptions, we can use Normal Distribution to model the Sampling Distribution of mean

note that it’s better to use the $t$ statistics described below

Normal Approximation

Normal approximation is crucial for this - because we use Normal Distribution to find percentiles

Assumptions

sample observations are independent
the distribution is not strongly skewed and there are few outliers
sample size is sufficiently large (e.g. $\geqslant 30$)
- the larger the sample, the more tolerant we can be to the skews (thanks to the C.L.T)
sampling distribution is symmetric, unimodal, no outliers - approximately normal

If these conditions are met, can use Normal Model to find the confidence intervals

Example

We have this data set that contains data about the whole population

Suppose we take 10k samples

and for each sample we calculate the mean
and then draw the histogram of this data - thus we’ll get the sampling distribution
we see that it’s normal, but can also try to draw the Normal Probability Plot to see that it’s indeed the case

R code to reproduce the experiment

```gdscript load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/ames.RData')) population = ames$Gr.Liv.Area oldpar = par(no.readonly=TRUE) 1. fig=c(x1, x2, y1, y2) par(fig=c(0, 1, 0, 1)) par(mar=c(6, 2, 2, 1)) h = hist(population, col='grey', probability=T, axes=F, xlab='', main='Histogram') dens = density(population, adjust=2) lines(dens, col="black", lwd=2) axis(side=1, pos=c(-max(h$density)/4,0)) axis(side=2) par(fig=c(0, 1, 0.16, 0.41), new=TRUE) par(mar=c(0, 2, 0, 1)) boxplot(population, horizontal=TRUE, axes=F, col='grey') par(oldpar) set.seed(1237) n = 50 samp.mean = replicate(10000, mean(sample(population, n))) plot(x=NA, y=NA, xlim=c(1250, 1750), ylim=c(0, 0.006), axes=F, xlab='Estimate of mean', ylab='Frequency', main='Sampling Distribution of mean') m = mean(samp.mean) s = sd(samp.mean) par(xpd=FALSE) rect(xleft=m-3*s, xright=m+3*s, ybottom=-1, ytop=1, border=NA, col=adjustcolor('blue', 0.1)) rect(xleft=m-2*s, xright=m+2*s, ybottom=-1, ytop=1, border=NA, col=adjustcolor('blue', 0.1)) rect(xleft=m-s, xright=m+s, ybottom=-1, ytop=1, border=NA, col=adjustcolor('blue', 0.1)) hist(samp.mean, probability=T, add=T, breaks=50, col='white') axis(side = 1) dens = dnorm(1200:1800, mean=m, sd=s) lines(1200:1800, dens, col="blue", lwd=2) qqnorm(samp.mean, col=adjustcolor('orange', 0.1)) qqline(samp.mean) ```

In this case all the assumptions hold - can use the Normal Approximation to calculate the confidence intervals

Model

$E[\bar{X}] = \mu$, it’s an unbiased estimate of mean
Standard Error: $\text{var}(\bar{X}) = \cfrac{\sigma^2}{n}$
by C.L.T. have $\bar{X} \approx N\left(\mu, \cfrac{\sigma^2}{n}\right)$
therefore
- $\cfrac{\bar{X} - \mu}{\sqrt{\sigma^2 / n}} \approx N(0, 1)$

So the 95% CI with $z$-score is:

$\left[ \bar{X} - 1.96 \sqrt{\sigma^2 / n}; \bar{X} + 1.96 \sqrt{\sigma^2 / n} \right]$

Estimating $\sigma$

To compute a CI we need to know $\sigma^2$, but it’s a parameter - we need to estimate it

We know that
- $\text{Var}(X) = \cfrac{1}{n - 1} \sum (x_i - \bar{X})^2 $
- $s = \text{sd}(X) = \sqrt{\text{Var}(X)}$
$\sigma^2$ is unknown. Can we substitute it by $s^2$?
- $s^2$ is unbiased estimate of $\sigma^2$
- $E[s^2] = \sigma^2$ (this is the reason for $n - 1$ instead of $n$)
- so yes, we can, but it’s better to use the $t$-distribution (described below) instead and use $s^2$

Using $t$ Statistic

To use normal approximation we need a sufficiently large sample

what if we don’t have it?
use $t$-statistics that follows $t$-distribution
- note that with high degrees of freedom $t$-distribution very closely resembles $N(0,1)$

$t$-distribution

We say that value follows $t$-distribution with $n - 1$ degrees of freedom
This distribution is similar to normal, but not quite: it’s little wider and allows for more uncertainty

Use the $t$- distribution rather than the normal distribution when

the variance is not known and
has to be estimated from sample data.

Shape of $t$ vs Normal:

When the sample size is large, e.g. $\geqslant$ 100, the $t$ is very similar to $N(0,1)$
on smaller sizes, $t$ is ‘‘leptokurtic’’ - it has relatively more scores in its tails than the normal distribution
- $\Rightarrow$ you have to extend farther from the mean to span a given proportion of the area.
for $N(0,1)$ 95% of data is within 1.96 $\sigma$ from the mean
for $t$, if you a sample size is only 5, 95% of the area is within 2.78 std from the mean
$\Rightarrow$ the SE of the mean would be multiplied by 2.78 rather than 1.96.

$t$-Statistic Confidence Intervals

Thus

we replace $\sigma^2 = s^2$ and
use the $t$-distribution with $n-1$ degrees of freedom
- i.e. replace $z$-score with $z = t_{n - 1}$

95% CI becomes

$\left[\bar{X} - t \cdot \sqrt{s^2 / 2}; \bar{X} + t \cdot \sqrt{s^2 / 2}\right]$
we we have $n = 400$ (therefore $df=399$), $t \approx 1.969$

R-code

In R:

n = 60
xbar = mean(d)
v = var(d)
t = qt(0.025, df=n-1, lower.tail=F)
ME = t * sqrt(v / n)
xbar + c(-ME, ME)

or: ```text only t.test(d, conf.int=0.95)$confint

The last chuck actually uses [$t$-test](t-test) and returns its confidence interval


### Examples
Example1: 
- The mean for 51 observations was 20
- The sample variance was 5
- Calculate the 99% CI for the mean

```scdoc
xbar = 20
v = 5
t = qt(0.005, df=50, lower.tail=F)
ME = t * sqrt(v / 50)
xbar + c(-ME, ME)
// [19.16, 20.84]

Sources

✏️ Edit on GitHub

Confidence Intervals for Means