Confidence Intervals
In Inferential Statistics we estimate a parameter of the population based on sample
- Point Estimate is just one single plausible value
- it’s a good idea to expand it a bit and build a confidence interval around the point estimate
- and use Standard Error as a measure of uncertainty in the Point Estimate to find this interval
Main idea - the CI should include the real parameter
Confidence Level
The degree of confidence at which we’re sure the interval will span the true parameter is ‘‘Confidence level’’
- e.g. 95% confidence interval contains the estimated parameter with probability 0.95 - i.e. in 1 case out of 20 it will miss the real parameter
The idea of Sampling Distribution is important here
- we use it to calculate percentiles of the possible values, if the SD was centered at our point estimate
- so the SI should span the true value
Example
- we want to estimate the mean
- suppose we happen to know the sampling distribution: it’s $N(\mu = 10, \sigma = 3.3)$
- it’s centered around the proportion mean $\mu$
- and the Standard Error is 3.3
- we draw a Point Estimate from the sampling distribution
- we get $\bar{X} = 5.5$
- Assuming that the SD is centered around 5.5, we compute 95% CI
- $z$-value is 1.96, so the interval is (-0.97 11.97)
- it includes the true value $\mu=10$
R code
```carbon x = seq(-10, 25, 0.3) m = 10 se = 3.3 plot(x, dnorm(x, mean=m, sd=se), type='l', bty='n', lty=2, ylab='') abline(v=m, lty=2) m.observed = 5.5 abline(v=m.observed, col='red') dy = dnorm(x, mean=5.5, sd=se) lines(x, y=dy, col='red') lo = m.observed - 1.96 * se hi = m.observed + 1.96 * se c(lo, hi) x1 = min(which(x >= lo)); x2 = max(which(x <= hi)) polygon(x[c(x1, x1:x2, x2)], c(0, dy[x1:x2], 0), col=adjustcolor('red', 0.4), border=NA) par(xpd=NA) text(m, 0.13, m) text(m.observed, 0.13, m.observed) arrows(x0=lo, y0=0.02, x1=hi, y1=0.02, code=3, length=0.15) text(m.observed, 0.02-0.005, 'confidence interval', cex=0.7) par(xpd=FALSE) ```A confidence interval consists of two parts
- left part - ‘‘lower bound’’
- right part - ‘‘upper bound ‘’
“95% confident” means that if we took many many samples from the SD and build a CI from each, then about 95% of these CIs should contain the actual parameter being estimated (e.g. $p$ for binom, $\mu$ for mean)
So we see indeed that sometimes the CI doesn’t include the true value but we’re 95% confident that a CI calculated from one sample will include it
R code to produce the figure
```gdscript load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/ames.RData')) population = ames$Gr.Liv.Area set.seed(1237) n = 50 sampl = replicate(51, sample(population, n)) sampl.sd = apply(sampl, MARGIN=2, sd) sampl.m = apply(sampl, MARGIN=2, mean) me = 1.96 * sampl.sd / sqrt(n) plot_ci(sampl.m - me, sampl.m + me, mean(population)) ```Margin Of Error
If the Sampling Distribution is symmetric (e.g. Normal Distribution or t-Distribution) we can calculate the CI bounds by adding and subtracting the ‘‘margin of error’’
- ’'’margin of error’’’ is typically percentile ($z$ or $t$ score) multiplied by Standard Error
Critical Value
Critical Value shows the level of confidence in our interval
- for $\alpha = 0.025$ CI is 90%
Types
Main types:
Statistical Simulation
Not always it’s possible to calculate everything with traditional methods
- but when we know the truth and can control it, we can simulate and build the Sampling Distribution, this way getting the CIs
- also, Bootstrapping (a Resampling method) is a powerful strategy for calculating CIs
Extra Stuff
Robustness
A method for constructing CIs is ‘‘robust’’ if
- the resulting CIs include the theoretical parameter approximately the percentage claimed by the confidence level
- even if not all necessary conditions for the CIs are satisfied
$t$-distribution is very robust and works well for the Normal Distribution as well as for skewed distributions
Relationship with Hypothesis Testing
Additional Resources
- applet for simulating CIs
- another applet
See Also
Sources
- Statistics: Making Sense of Data (coursera)
- OpenIntro Statistics (book)
- https://en.wikipedia.org/wiki/Confidence_interval