Binomial Proportion Confidence Intervals

These are Confidence Intervals for estimating a proportion in the population

  • When we sample, we calculate a Point Estimate of the proportion
  • We know that due to variance in the Sampling Distribution each time we get different estimates
  • How we can expand the point estimate so it's likely to include the true value?


Normal Approximation

This type of CI makes use of Central Limit Theorem and Normal Approximation of Binomial Distribution

So, for any experiment, let

  • $p$ be the true probability
  • $n$ be the number of trials

Then

  • estimate $p$ as $\hat{p} = \cfrac{\text{success}}{n}$
  • and the CI is $\left[\hat{p} - 1.96 \sqrt{p(1-p)/n}; \hat{p} + 1.96 \sqrt{p(1-p)/n}\right]$


Building a Confidence Interval

  • We have a sample of $n$ observations: $X_1, ..., X_{n}$
  • let $\hat{p} = $ fraction of successful $X_i$, i.e. $\hat{p} = \cfrac{\text{# of success}}{n}$
    • so we calculate $\hat{p}$ based on the data in the sample
  • if all observations are independent and probability of success $p$ is always the same, then Sampling Distribution is Binomial Distribution
    • i.e. each $X_i \sim \text{Bernoulli}(p)$, variance $\text{Var}[X_i] = p \cdot (1 - p) = pq$


Parameters of the Sampling Distribution

  • $\hat{p}$ is an Unbiased Estimate of $p$: $E[\hat{p}] = p$
    • $E[X_i] = p, \hat{p} = \cfrac{1}{n} \sum_{i=1}^n X_i$
    • $E[\hat{p}] = E \left[ \cfrac{1}{n} \sum_{i=1}^n X_i \right] = \cfrac{1}{n} \sum_{i=1}^n E [ X_i ] = \cfrac{np}{n} = p$
  • $\text{var}[\hat{p}] = \cfrac{p(1-p)}{n}$
    • $\text{var}[\hat{p}] = \text{var} \left[ \cfrac{1}{n} \sum_{i=1}^n X_i \right] = \cfrac{1}{n^2} \sum_{i=1}^n \text{var}[X_i] = \cfrac{npq}{n^2} = \cfrac{pq}{n} = \cfrac{p(1-p)}{n}$
    • $\text{sd}[ \hat{p} ] = \sqrt{ \cfrac{p \cdot (1 - p)}{n} }$
  • Now we use the Normal Approximation (i.e. apply the C.L.T. and calculate that the SD follows Normal Distribution $N \left( \mu=p, \sigma = \sqrt{ \cfrac{p(1-p)}{n} } \right)$)


We want to build CI at level of $\alpha$


E.g. 95% CI

  • $z = 1.96$ and we know that 95% of the values lie in $(-z, +z)$
  • 343067151a98437b821fae10709a8e52.png
  • So only in 5 experiments out of 100 you end up outside of this interval


R code to produce the figure  
x = seq(-3, 3, 0.1)
y = dnorm(x)

plot(x, y, type='l', bty='n', main='95% CI on N(0,1)')

x1 = min(which(x > -1.96)); x2 = max(which(x < 1.96))
polygon(x[c(x1, x1:x2, x2)],
        c(0, y[x1:x2], 0), col=adjustcolor('blue', 0.4))

text(x=0, y=0.2, labels='0.95', cex=1.5)
text(x=c(-2.07, 2.07), y=0.025, labels='0.025', cex=0.6)



This is called 95% confidence interval for $p$:

  • $\left[\hat{p} - 1.96 \sqrt{p(1-p)/n}; \hat{p} + 1.96 \sqrt{p(1-p)/n}\right]$
  • left part - lower bound
  • right part - upper bound


We say that we're 95% confident that the true value of $p$ is somewhere in this interval.


Margin of Error

Problem: $p$ (to use under the square root) is unknown!

Solutions:

  • use $\hat{p}$ instead of $p$ (we assume it should be close) or
  • use $p = q = 0.5$: it maximizes our margin of error
    • margin of error is $\beta = 1.96 \sqrt{p(1-p)/n}$


Critical Value

Why we chose 95% CI with $\alpha = 0.05$ and not another one?

We can compute any confidence interval using any $\alpha$

  • Compute critical value $z_{\alpha/2}$ such that "not interesting" areas under the normal curve take $\alpha / 2$
    • ci-critical-value.png
  • so the interval will be $\left[-z_{\alpha/2}; z_{\alpha/2}\right]$ and the "interesting" area under the bell curve is $1 - \alpha$

Margin of error is this case is

  • $\beta = z_{\alpha/2} \sqrt{p(1-p)/n}$ and, as we know, for $95\%$, $\alpha = 0.025$ and $z_{0.025} = 1.96$


As we see, the CI becomes wider as critical value grows


Assumptions

Note that by using the C.L.T. we assume that:

he central limit theorem applies poorly to this distribution with a sample size less than 30 or where the proportion is close to 0 or 1. The normal approximation fails totally when the sample proportion is exactly zero or exactly one. A frequently cited rule of thumb is that the normal approximation is a reasonable one as long as np > 5 and n(1 − p) > 5;


Examples

Flipping a Beer Cap

Imagine an experiment where we flip a beer cap

  • it follows the Binomial Distribution, but we don't know the true parameter $p$
  • say we flipped a beer cap 1000 times and got 576 reds: $\hat{p} = 0.576$
  • what is its statistical model? What is $p$ in $\text{Binomial}(1000, p)$?

Let's build a Confidence Interval for that

  • so we estimate $\hat{p} = 0.576$ and $\text{Var}(\hat{p}) = \cfrac{p(1-p)}{n} = \cfrac{p(1-p)}{1000}$


Result:

  • 95% CI is $[0.545, 0.607]$


In R:

phat = 0.576
z = qnorm(0.025, mean=0, sd=1, lower.tail=F) // the right tail rather then left
ME = z * sqrt(phat * (1- phat) / 1000) // Margin of error: we replace p by phat
CI = phat + c(-ME, ME) // 0.545, 0.606


Example 2

  • Calculate the 90% CI for $p$
  • With 60 successes out of 100 trials
phat = 0.6
cl = 0.9
al = (1 - cl) / 2
z = qnorm(al, mean=0, sd=1, lower.tail=F)
n = 100

ME = 2 * sqrt(0.5 * 0.5 / n)
ci = phat + c(-ME, ME)
// [0.52, 0.68]


Sample Size

see Sample Size Estimation


See Also

Sources