Goal of Inferential Statistics - to make conclusion about the whole population based on a sample

- So we estimate the parameters based on sampled data
- if the estimate is just one number, we call it a Point Estimate

- And with different samples (from the same population) we get different estimates of the same parameter - so we have
*variability*(*sampling variability*) in estimates - The probability distribution of the parameter estimate is called
*Sampling Distribution*

The sampling distribution represents the distribution of point estimates based on samples of fixed size from the same population

- we can think that a particular Point Estimate is drawn from the sampling distribution
- and
**Standard Error**is the measure of variability (e.g. how uncertain we are about our estimate)

- Suppose we flip a coin 10 times and count the number of heads
- Our parameter of interest is $p = p(\text{heads})$
- $\hat{p}$ - estimate of $p$
- $\hat{p} = \cfrac{\text{# of heads}}{\text{total # of flips}}$
- i.e. $\hat{p}$ is calculated from data

set.seed(134) rbinom(10, size=10, prob=0.5)

We get different results each time:

Trial | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|

Outcome | 4 | 6 | 7 | 4 | 5 | 3 | 4 | 6 | 3 | 6 |

Since we know that theoretically this Random Variable follows Binomial Distribution, we can model the sampling distribution as

d = dbinom(1:10, size=10, prob=0.5) bp = barplot(d) axis(side=1, at=bp[1:10], labels=1:10)

This sampling distribution is used for Binomial Proportion Confidence Intervals and for Binomial Proportion Test

- note that as the sample size grows it becomes more reasonable to use the Normal Approximation

load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/ames.RData')) area = ames$Gr.Liv.Area sample_means50 = rep(NA, 5000) for (i in 1:5000) { samp = sample(area, 50) sample_means50[i] = mean(samp) } hist(sample_means50, breaks=13, probability=T, col='orange', xlab='point estimates of mean', main='Sampling distribuion of mean')

There's another example that shows that the more data we have, the more accurate our point estimates are

- A
*running mean*(or 'Moving Average') is a sequence of means, where each following mean uses one extra observation - If we take the moving average from 1 data point and keep including next ones, it approaches the "true mean"

R code to produce the figure

library(openintro) data(run10Samp) time = run10Samp$time avg = sapply(X=1:100, FUN=function(x) { mean(time[1:x]) }) plot(x=1:100, y=avg, type='l', col='blue', ylab='running mean', xlab='sample size', bty='n') abline(h=mean(time), lty=2, col='grey')

So it illustrates that the more sample size is, the better we can estimate the parameter

- Normal Distribution for this
- t Distribution for that

- OpenIntro Statistics (book)
- Statistics: Making Sense of Data (coursera)
- DataCamp, Lab 3A - Sampling distributions [1]
- https://en.wikipedia.org/wiki/Sampling_distribution