Sampling Distribution

Parameter Estimation

Goal of Inferential Statistics - to make conclusion about the whole population based on a sample

  • So we estimate the parameters based on sampled data
  • And with different samples (from the same population) we get different estimates of the same parameter - so we have variability (sampling variability) in estimates
  • The probability distribution of the parameter estimate is called Sampling Distribution

Sampling Distribution

The sampling distribution represents the distribution of point estimates based on samples of fixed size from the same population

  • we can think that a particular Point Estimate is drawn from the sampling distribution
  • and Standard Error is the measure of variability (e.g. how uncertain we are about our estimate)


Example 1

  • Suppose we flip a coin 10 times and count the number of heads
  • Our parameter of interest is $p = p(\text{heads})$
  • $\hat{p}$ - estimate of $p$
    • $\hat{p} = \cfrac{\text{# of heads}}{\text{total # of flips}}$
    • i.e. $\hat{p}$ is calculated from data
rbinom(10, size=10, prob=0.5)

We get different results each time:

Trial 1 2 3 4 5 6 7 8 9 10
Outcome 4 6 7 4 5 3 4 6 3 6

Since we know that theoretically this Random Variable follows Binomial Distribution, we can model the sampling distribution as

d = dbinom(1:10, size=10, prob=0.5)
bp = barplot(d)
axis(side=1, at=bp[1:10], labels=1:10)


This sampling distribution is used for Binomial Proportion Confidence Intervals and for Binomial Proportion Test

Example 2

area = ames$Gr.Liv.Area
sample_means50 = rep(NA, 5000)
for (i in 1:5000) {
  samp = sample(area, 50)
  sample_means50[i] = mean(samp)

hist(sample_means50, breaks=13, probability=T, col='orange',
     xlab='point estimates of mean', main='Sampling distribuion of mean')


Example: Running Mean

There's another example that shows that the more data we have, the more accurate our point estimates are

  • A running mean (or 'Moving Average') is a sequence of means, where each following mean uses one extra observation
  • If we take the moving average from 1 data point and keep including next ones, it approaches the "true mean"


R code to produce the figure  
time = run10Samp$time
avg = sapply(X=1:100, FUN=function(x) { mean(time[1:x]) })
plot(x=1:100, y=avg, type='l', col='blue',
     ylab='running mean', xlab='sample size', bty='n')
abline(h=mean(time), lty=2, col='grey')

So it illustrates that the more sample size is, the better we can estimate the parameter

Typical Sampling Distributions