Sampling Distribution
Parameter Estimation
Goal of Inferential Statistics - to make conclusion about the whole population based on a sample
- So we estimate the parameters based on sampled data
- if the estimate is just one number, we call it a Point Estimate
- And with different samples (from the same population) we get different estimates of the same parameter - so we have ‘‘variability’’ (‘‘sampling variability’’) in estimates
- The probability distribution of the parameter estimate is called ‘‘Sampling Distribution’’
Sampling Distribution
The sampling distribution represents the distribution of point estimates based on samples of fixed size from the same population
- we can think that a particular Point Estimate is drawn from the sampling distribution
- and Standard Error is the measure of variability (e.g. how uncertain we are about our estimate)
Examples
Example 1
- Suppose we flip a coin 10 times and count the number of heads
- Our parameter of interest is $p = p(\text{heads})$
- $\hat{p}$ - estimate of $p$
- $\hat{p} = \cfrac{\text{# of heads}}{\text{total # of flips}}$
- i.e. $\hat{p}$ is calculated from data
```text only set.seed(134) rbinom(10, size=10, prob=0.5)
We get different results each time:
| Trial | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Outcome | 4 | 6 | 7 | 4 | 5 | 3 | 4 | 6 | 3 | 6 |
Since we know that theoretically this [Random Variable](Random_Variable) follows [Binomial Distribution](Binomial_Distribution), we can model the sampling distribution as
```text only
d = dbinom(1:10, size=10, prob=0.5)
bp = barplot(d)
axis(side=1, at=bp[1:10], labels=1:10)
This sampling distribution is used for Binomial Proportion Confidence Intervals and for Binomial Proportion Test
- note that as the sample size grows it becomes more reasonable to use the Normal Approximation
Example 2
load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/ames.RData'))
area = ames$Gr.Liv.Area
sample_means50 = rep(NA, 5000)
for (i in 1:5000) {
samp = sample(area, 50)
sample_means50[i] = mean(samp)
}
hist(sample_means50, breaks=13, probability=T, col='orange',
xlab='point estimates of mean', main='Sampling distribuion of mean')
Example: Running Mean
There’s another example that shows that the more data we have, the more accurate our point estimates are
- A ‘‘running mean’’ (or ‘Moving Average’) is a sequence of means, where each following mean uses one extra observation
- If we take the moving average from 1 data point and keep including next ones, it approaches the “true mean”
R code to produce the figure
```carbon library(openintro) data(run10Samp) time = run10Samp$time avg = sapply(X=1:100, FUN=function(x) { mean(time[1:x]) }) plot(x=1:100, y=avg, type='l', col='blue', ylab='running mean', xlab='sample size', bty='n') abline(h=mean(time), lty=2, col='grey') ```So it illustrates that the more sample size is, the better we can estimate the parameter
Typical Sampling Distributions
- Normal Distribution for this
- t Distribution for that
Sources
- OpenIntro Statistics (book)
- Statistics: Making Sense of Data (coursera)
- DataCamp, Lab 3A - Sampling distributions [http://rpubs.com/agrigorev/21595]
- https://en.wikipedia.org/wiki/Sampling_distribution