ML Wiki
Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

Simulation For Proportions

Simulation For Proportions

Sometimes Statistical Inference can be done without applying theoretical models, but instead with using brute force: generating the data ourselves.

Consider Proportions test

One-Sample Test

It’s the same as One-Sample test for the normal approximation models:

  • we have a sample and want to check if the true proportion parameter agree with some hypothetical parameter $p_0$
  • and then we want to check if the data we observed align with this hypothesis

Test

  • $H_0: p = p_0$
  • $H_A: p \ne p_0$ or $H_A: p < p_0$ or * $H_A: p > p_0$
  • $p$ - the true proportion, $p_0$ - the null value

But instead of using some theoretical model,

  • we ourselves generate the null distribution
  • and then see how unusual the observed value is w.r.t. the generated null distr.

Example

Consider the following example:

  • medical consultant helps patients
  • he claims that with his help the ratio of complications is lower than usually
    • (i.e. lower than 0.10)
  • is it true?

We want to test a hypothesis:

  • $H_0: p_A = 0.10$ - ratio of complications without a specialist
  • $H_A: p_A < 0.10$ - specialist helps, the complications ratio is lower than usual
  • note that we can’t really check the claim because we have Observational Studies - to really check the claim we need to conduct a Statistical Experiment

Observed data:

  • 3 complications in 62 cases
  • $\hat{p} = 0.048$
  • is it only due to chance?

Normal Model

What we can do?

  • There is still a way to evaluate the $p$-value for this $p_A = 0.10$ - via simulations
  • Simulate many draws from the population and build a Sampling Distribution (under $H_0$)
  • then compute the probability of observing such $\hat{p}$ in this distribution

Test

  • Assume that the help of the specialist gives nothing
  • i.e. 10% of cases will still have complications
  • under this assumptions we try to simulate 62 clients

Simulation

  • repeat many times (e.g. 5-10k) to build a Sampling Distribution
    • draw a sample from the Binomial Distribution with $p=0.10$ and $n=62$
    • calculate $\hat{p}_\text{sim}$ from this sample
  • draw a histogram
  • and shade bars that support the $H_A$ - ones with $hat{p}_\text{sim} < 0.048$
  • the shaded area represents the $p$-value - the probability of observing such small $\hat{p}$ only due to chance

This is the histogram of the Sampling Distribution we obtained:

  • Image

From 10k draws 487 turned out to be below $\hat{p}$

  • which means $p$-value is $487/10000 = 0.0487 < 0.05$
  • so we reject $H_0$ in favor of $H_A$ and conclude that there’s indeed some relation between the participation of the consultant and the complications ratio

R code:

```text only n = 62 p = 0.10 m = 10000

set.seed(31313) samp.dist = rbinom(n=m, size=n, prob=p) / n

p.hat = 0.048 sum(samp.dist <= p.hat) p.val = sum(samp.dist <= p.hat) / length(samp.dist) p.val

ac = cut(samp.dist, breaks=18) means = tapply(samp.dist, ac, mean) levels(ac) = round(means, digits=3)

tbl = table(ac) / length(samp.dist) tbl cl = rep(‘grey’, length(tbl)) cl[1:4] = ‘black’

barplot(tbl, col=cl, las=2) ```

Sources