Chi-Squared Goodness of Fit Test
This is one of $\chi^2$ tests
- one-way table tests - for testing Frequency Tables, this one
- two-way table tests - for testing Contingency Tables, Chi-Squared Test of Independence
$\chi^2$ One-Way Table Test
This is a method for assessing a null model when the data is binned
Used when:
- given a sample of cases that can be classified into several groups, determine if the sample is representative of the general population
- evaluate is the data resemble some distribution, e.g. normal or geometric (“Goodness Of Fit”)
Idea
Goodness of Fit test:
- suppose we have a variable with $n$ modalities
- it can be a categorical variable with $n$ groups
- or numerical data binned into $n$ bins
- or even discrete numerical data with not many distinct values
- suppose that we have some observed values $O_i$, for each bin $i$
- also for each bin $i$, values $E_i$ represent values expected under $H_0$
- are the observed values statistically different from expected?
Test
- $H_0$: the observed values do not differ from given distribution
- $H_A$: the observed values are statistically different from expected
Test Statistics $X^2$
- for each group $i$ we calculate the squared difference between observed and expected
- this difference is normalized with standard error for each group
Values:
- $O_i$ - observed count
- $E_i$ - count expected under $H_0$
Test statistics
- we can think of it as calculating $n$ $Z$ statistics (standardized differences) and sum them up:
- $Z_i = \cfrac{O_i - E_i}{\text{SE}_i}$, each $Z_i$ follows the Normal Model
- note that $\text{SE}_i$ is a sampling distribution under $H_0$, i.e.
- $\text{SE}_i = \sqrt{ E_i }$
- Since we want to minimize the squared error, we calculate
- $X^2 = \sum_{i=1}^{k} Z^2i = \sum{i=1}^{k} \cfrac{(O_i - E_i)^2}{ E_i }$
- it’s called Pearson’s cumulative test stat
- it follows $\chi^2$ distribution with $k - 1$ degrees of freedom, where $k$ is the number of categories
$p$-values
- typically need only upper-tail values
- because the larger values correspond to stronger evidence against $H_0$
Conditions
- The observations must be independent
- Sample size should be big enough, so we should have at least 5 at each cell of the expected count table
- $\text{df} \geqslant 2$
Examples
Example: County Jurors
- suppose we have a set of 275 jurors from a small county
- they are categorized with their racial group
- are they representing the population of eligible jurors or there’s some racial bias?
- (source: OpenIntro, table 6.5)
| Race | White | Black | Hispanic | Other | Total | County | 205 | 26 | 25 | 19 || 275 | Ratio | 0.75 | 0.09 | 0.09 | 0.07 || 1.00 | Population Ratio | 0.72 | 0.07 | 0.12 | 0.09 | || 1.00 | |
- It doesn’t look like it’s precisely representative
- might it be solely due to chance or there’s some bias?
Expected values
- What we do is to create another table, where we add expected
- Expected numbers represent the values we expect to see if the sample set was entirely representative
| Race | White | Black | Hispanic | Other | Total | Observed | 205 | 26 | 25 | 19 || 275 | Expected | 198 | 19.25 | 33 | 24.75 || 275 | | And now we calculate the squared difference between observed and expected values for each category:
Test:
- $H_0$: the jurors are random sample, there is no racial bias and the observed counts reflect natural sampling variability
- $H_A$: there’s racial bias in the selection
Calculation:
- $X^2 = \cfrac{(205 - 198)^2}{198} + \cfrac{(26 - 19.25)^2}{19.25} + \cfrac{(25 - 33)^2}{33} + \cfrac{(19 - 24.75)^2}{24.75} \approx 5.8$
- $p$-values is quite big: 0.11 - so we can’t reject $H_0$
R code Manual: ```text only obs = c(205, 26, 25, 19) exp = c(198, 19.25, 33, 24.75)
x2 = sum( (obs-exp)^2 / exp )
x = seq(0, 10, 0.01) y = dchisq(x, df=length(obs) - 1) plot(x, y, type=’l’, bty=’n’, ylab=’probability’, xlab=’x value’)
x.min = min(which(x >= x2)) x.max = length(x) polygon(x=x[c(x.min, x.min:x.max, x.max)], y=c(0, y[x.min:x.max], 0), col=’orange’)
pchisq(x2, df=length(obs) - 1, lower.tail=F)
=== Example: Trading ===
- Suppose that we have some data from some stock exchange
- we want to test if stock activity on one day is independent from previous day
- the data is taken [http://research.stlouisfed.org/fred2/series/SP500/downloaddata] for 2004-08-04 to 2014-07-01
- example motivated by an example from [OpenIntro](OpenIntro_Statistics_(book))
Idea
- If the change in the price was positive, we say that that stock was up ($U$), otherwise we say it was down $D$)
- if the days are really independent, then the number of days before seeing $U$ should follow [Geometric Distribution](Geometric_Distribution).
- How many days should we wait until seeing $U$?
| Days | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7+ | Expected | 540.5 | 270.25 | 135.13 | 67.56 | 33.78 | 16.89 | 8.45 | 7.39 || Actual | | 450.0 | 298.00 | 150.00 | 85.00 | 53.00 | 22.00 | 13.00 | 10.00 |
<img src="http://habrastorage.org/files/6a9/7b6/ace/6a97b6acefca4480a96660a7dd0ac6f4.png" alt="Image">
Test:
- $H_0$: stock marked days are independent from each other
- i.e. we assume that the number of days before seeing $U$ follows geometric distribution
- $H_A$: not independent
Calculations:
- calculate $X^2 = \sum_{i=0}^{7} \cfrac{(O_i - E_i)^2}{E_i} \approx 43.04$
- $k = 8$, $\text{df} = 8 - 1 = 7$,
- calculate the $p$ value: $p \approx 10^{-6}$
- so we reject $H_0$ and conclude that the market days are not independent from each other
<img src="http://habrastorage.org/files/bb9/c3c/ba9/bb9c3cba91844b3facdf9e0d37e35bc2.png" alt="Image">
<details>
<summary>R code</summary>
```text only
sp500 = read.csv('http://goo.gl/lv268V')
values = as.numeric( as.character(sp500$VALUE) )
change = as.factor(values > 0)
levels(change) = c('D', 'U')
change = change[complete.cases(change)]
y = rep(0, length(change))
y[change == 'U'] = 1
y = c(0, y, 0)
wz = which(y == 0)
streak = diff(wz) - 1
1. chi^2 test
act = table(streak)
n = length(streak)
k = length(act)
exp = n / (2 ^ (1:k))
barplot(rbind(exp, act), beside=T, col=c('skyblue', 'orange'))
legend('topright', c('expected', 'actual'), bty='n', pch=15,
col=c('skyblue', 'orange'))
x2 = sum( (act - exp)^2 / exp )
pchisq(x2, df=k - 1, lower.tail=F)
c(x2=x2, theoretic=qchisq(0.95, df=k - 1))
1. let's merge the data for 7,8 and 9 days
streak[streak >= 7] = 7
streaks = as.factor(streak)
levels(streaks)[8] = '7+'
act = table(streaks)
exp.n = c(exp[1:7], sum(exp[8:10]))
barplot(rbind(exp.n, act), beside=T, col=c('skyblue', 'orange'))
legend('topright', c('expected', 'actual'), bty='n', pch=15,
col=c('skyblue', 'orange'))
k = length(act)
x2 = sum( (act - exp.n)^2 / exp.n )
pchisq(x2, df=k - 1, lower.tail=F)
c(x2=x2, theoretic=qchisq(0.95, df=k - 1))
</details>
Links
- http://en.wikipedia.org/wiki/Goodness_of_fit#Pearson.27s_chi-squared_test
Sources
- OpenIntro Statistics (book)
- https://onlinecourses.science.psu.edu/stat504/node/61