Sample Size

When we need to estimate the size of a sample?


Confidence Intervals

Binomial Proportion Confidence Intervals

The best $n$?

  • CI for $p$
    • $\beta = z_{\alpha/2} \sqrt{p(1-p)/n}$,
  • we want margin error be $\beta = 0.03$
  • for 95% CI $\alpha = 0.025$

We plug everything in and calculate

  • $0.03 = 1.96 \sqrt{p(1-p)/n}$
  • use $p = 0.5$ (it's the worst-case scenario)
  • $n = \left(\cfrac{1.96 \cdot 0.5}{0.03}\right)^2 \approx 1067$

What if we want margin error 0.01?

  • $0.01 = 1.96 \sqrt{0.5 \cdot 0.5 / n}$ or
  • $n \approx 9604 = 9 \cdot 1067$ !

To cut the margin error in half we need 4 times bigger sample size!

    • So lovering the margin is expensive


What if we want 99% CI instead of 95%?

  • $z_{\alpha/2} = z_{0.005} \approx 2.576$
  • $0.03 = 2.576 \sqrt{0.5 \cdot 0.5 / n}$
  • $n \approx 1843$

For 90% CI $n \approx 753$



R

Example 1

p = 0.5
ME = 0.03
z = qnorm(0.025, mean=0, sd=1, lower.tail=F)
n = z^2 * p * (1 - p) / ME^2


Example 2:

  • What sample size is needed to attain a margin error of 0.5% for 99% CI?
p = 0.5 // worst-case estimate
ME = 0.005
cl = 0.99
al = (1 - cl) / 2

z = qnorm(al, mean=0, sd=1, lower.tail=F)
n = z^2 * p * (1 - p) / ME^2


Controlling False Negatives

Sample Size controls Type II Errors - False Negatives


$Z$ Statistics for Means

Suppose we want to have a 95% confidence interval

  • $Z = 1.96$
  • $\text{ME}_{0.95} = Z \cdot \text{SE} = 1.96 \cfrac{\sigma}{\sqrt{n}}$
  • we want $\text{ME} \leqslant 4$
  • so $1.96 \cdot \cfrac{\sigma}{\sqrt{n}} \leqslant 4$, and we want to get $n$ from this inequality
    • NOTE: need to know $\sigma$, otherwise we should use $T$ statistics instead of $Z$ and estimate $\sigma$ by $s$
  • e.g. suppose that we know that the whole country $\sigma$ is 25, so it might be a good estimate for $\sigma$ within a company


We get:

  • $1.96 \cdot \cfrac{\sigma}{\sqrt{n}} \approx 1.96 \cdot \cfrac{25}{\sqrt{n}}$
  • $1.96 \cdot \cfrac{25}{4} \leqslant \sqrt{n}$
  • $\left( 1.96 \cdot \cfrac{25}{4} \right)^2 \leqslant n$
  • $150.06 \leqslant n$

$\Rightarrow$ we need $n \geqslant 151$ to have ME of 4


Stuff

TODO: Interesting Links


Caveat, with such a large sample size, one would expect even tiny differences in occurrence to produce "strong" significance levels. See Why does frequentist hypothesis testing become biased towards rejecting the null hypothesis with sufficiently large samples?

http://stats.stackexchange.com/questions/108911/why-does-frequentist-hypothesis-testing-become-biased-towards-rejecting-the-null/

I've seen this claim before in Karl Friston's 2012 paper in NeuroImage, where he calls it the fallacy of classical inference.


Sources