ML Wiki
Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

Standard Error

Sampling Distribution

Parameter Estimation

Goal of Inferential Statistics - to make conclusion about the whole population based on a sample

  • So we estimate the parameters based on sampled data
  • And with different samples (from the same population) we get different estimates of the same parameter - so we have ‘‘variability’’ (‘‘sampling variability’’) in estimates
  • The probability distribution of the parameter estimate is called ‘‘Sampling Distribution’’

Sampling Distribution

The sampling distribution represents the distribution of point estimates based on samples of fixed size from the same population

  • we can think that a particular Point Estimate is drawn from the sampling distribution
  • and Standard Error is the measure of variability (e.g. how uncertain we are about our estimate)

Examples

Example 1

  • Suppose we flip a coin 10 times and count the number of heads
  • Our parameter of interest is $p = p(\text{heads})$
  • $\hat{p}$ - estimate of $p$
    • $\hat{p} = \cfrac{\text{# of heads}}{\text{total # of flips}}$
    • i.e. $\hat{p}$ is calculated from data

```text only set.seed(134) rbinom(10, size=10, prob=0.5)


We get different results each time:

|   Trial  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |  9  |  10  |   Outcome   |  4  |  6  |  7  |  4  |  5  |  3  |  4  |  6  |  3  |  6 |

Since we know that theoretically this [Random Variable](Random_Variable) follows [Binomial Distribution](Binomial_Distribution), we can model the sampling distribution as

```text only
d = dbinom(1:10, size=10, prob=0.5)
bp = barplot(d)
axis(side=1, at=bp[1:10], labels=1:10)

Image

This sampling distribution is used for Binomial Proportion Confidence Intervals and for Binomial Proportion Test

Example 2

load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/ames.RData'))
area = ames$Gr.Liv.Area
sample_means50 = rep(NA, 5000)
 
for (i in 1:5000) {
  samp = sample(area, 50)
  sample_means50[i] = mean(samp)
}

hist(sample_means50, breaks=13, probability=T, col='orange',
     xlab='point estimates of mean', main='Sampling distribuion of mean')

Image

Example: Running Mean

There’s another example that shows that the more data we have, the more accurate our point estimates are

  • A ‘‘running mean’’ (or ‘Moving Average’) is a sequence of means, where each following mean uses one extra observation
  • If we take the moving average from 1 data point and keep including next ones, it approaches the “true mean”

Image

R code to produce the figure ```carbon library(openintro) data(run10Samp) time = run10Samp$time avg = sapply(X=1:100, FUN=function(x) { mean(time[1:x]) }) plot(x=1:100, y=avg, type='l', col='blue', ylab='running mean', xlab='sample size', bty='n') abline(h=mean(time), lty=2, col='grey') ```

So it illustrates that the more sample size is, the better we can estimate the parameter

Typical Sampling Distributions

Sources