# ML Wiki

## Sampling Distribution

### Parameter Estimation

Goal of Inferential Statistics - to make conclusion about the whole population based on a sample

• So we estimate the parameters based on sampled data
• And with different samples (from the same population) we get different estimates of the same parameter - so we have variability (sampling variability) in estimates
• The probability distribution of the parameter estimate is called Sampling Distribution

### Sampling Distribution

The sampling distribution represents the distribution of point estimates based on samples of fixed size from the same population

• we can think that a particular Point Estimate is drawn from the sampling distribution
• and Standard Error is the measure of variability (e.g. how uncertain we are about our estimate)

## Examples

### Example 1

• Suppose we flip a coin 10 times and count the number of heads
• Our parameter of interest is $p = p(\text{heads})$
• $\hat{p}$ - estimate of $p$
• $\hat{p} = \cfrac{\text{# of heads}}{\text{total # of flips}}$
• i.e. $\hat{p}$ is calculated from data
set.seed(134)
rbinom(10, size=10, prob=0.5)


We get different results each time:

Trial 1 2 3 4 5 6 7 8 9 10
Outcome 4 6 7 4 5 3 4 6 3 6

Since we know that theoretically this Random Variable follows Binomial Distribution, we can model the sampling distribution as

d = dbinom(1:10, size=10, prob=0.5)
bp = barplot(d)
axis(side=1, at=bp[1:10], labels=1:10) This sampling distribution is used for Binomial Proportion Confidence Intervals and for Binomial Proportion Test

### Example 2

load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/ames.RData'))
area = ames$Gr.Liv.Area sample_means50 = rep(NA, 5000) for (i in 1:5000) { samp = sample(area, 50) sample_means50[i] = mean(samp) } hist(sample_means50, breaks=13, probability=T, col='orange', xlab='point estimates of mean', main='Sampling distribuion of mean') ### Example: Running Mean There's another example that shows that the more data we have, the more accurate our point estimates are • A running mean (or 'Moving Average') is a sequence of means, where each following mean uses one extra observation • If we take the moving average from 1 data point and keep including next ones, it approaches the "true mean" R code to produce the figure library(openintro) data(run10Samp) time = run10Samp$time
avg = sapply(X=1:100, FUN=function(x) { mean(time[1:x]) })
plot(x=1:100, y=avg, type='l', col='blue',
ylab='running mean', xlab='sample size', bty='n')
abline(h=mean(time), lty=2, col='grey')


So it illustrates that the more sample size is, the better we can estimate the parameter