Sampling Distribution

Parameter Estimation

Goal of Inferential Statistics - to make conclusion about the whole population based on a sample

  • So we estimate the parameters based on sampled data
  • And with different samples (from the same population) we get different estimates of the same parameter - so we have variability (sampling variability) in estimates
  • The probability distribution of the parameter estimate is called Sampling Distribution


Sampling Distribution

The sampling distribution represents the distribution of point estimates based on samples of fixed size from the same population

  • we can think that a particular Point Estimate is drawn from the sampling distribution
  • and Standard Error is the measure of variability (e.g. how uncertain we are about our estimate)


Examples

Example 1

  • Suppose we flip a coin 10 times and count the number of heads
  • Our parameter of interest is $p = p(\text{heads})$
  • $\hat{p}$ - estimate of $p$
    • $\hat{p} = \cfrac{\text{# of heads}}{\text{total # of flips}}$
    • i.e. $\hat{p}$ is calculated from data
set.seed(134)
rbinom(10, size=10, prob=0.5)

We get different results each time:

Trial 1 2 3 4 5 6 7 8 9 10
Outcome 4 6 7 4 5 3 4 6 3 6


Since we know that theoretically this Random Variable follows Binomial Distribution, we can model the sampling distribution as

d = dbinom(1:10, size=10, prob=0.5)
bp = barplot(d)
axis(side=1, at=bp[1:10], labels=1:10)

b3900183fe9f478fadf895deed1d0d56.png


This sampling distribution is used for Binomial Proportion Confidence Intervals and for Binomial Proportion Test


Example 2

load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/ames.RData'))
area = ames$Gr.Liv.Area
sample_means50 = rep(NA, 5000)
 
for (i in 1:5000) {
  samp = sample(area, 50)
  sample_means50[i] = mean(samp)
}

hist(sample_means50, breaks=13, probability=T, col='orange',
     xlab='point estimates of mean', main='Sampling distribuion of mean')

d9a06a02d0944fb495c81c29daa29047.png


Example: Running Mean

There's another example that shows that the more data we have, the more accurate our point estimates are

  • A running mean (or 'Moving Average') is a sequence of means, where each following mean uses one extra observation
  • If we take the moving average from 1 data point and keep including next ones, it approaches the "true mean"

454073b0ac4149c789916b3dba2c61c6.png


R code to produce the figure  
library(openintro)
data(run10Samp)
time = run10Samp$time
avg = sapply(X=1:100, FUN=function(x) { mean(time[1:x]) })
plot(x=1:100, y=avg, type='l', col='blue',
     ylab='running mean', xlab='sample size', bty='n')
abline(h=mean(time), lty=2, col='grey')


So it illustrates that the more sample size is, the better we can estimate the parameter


Typical Sampling Distributions


Sources