# ML Wiki

## Chi-Squared Goodness of Fit Test

This is one of $\chi^2$ tests

### $\chi^2$ One-Way Table Test

This is a method for assessing a null model when the data is binned

Used when:

• given a sample of cases that can be classified into several groups, determine if the sample is representative of the general population
• evaluate is the data resemble some distribution, e.g. normal or geometric ("Goodness Of Fit")

### Idea

Goodness of Fit test:

• suppose we have a variable with $n$ modalities
• it can be a categorical variable with $n$ groups
• or numerical data binned into $n$ bins
• or even discrete numerical data with not many distinct values
• suppose that we have some observed values $O_i$, for each bin $i$
• also for each bin $i$, values $E_i$ represent values expected under $H_0$
• are the observed values statistically different from expected?

### Test

• $H_0$: the observed values do not differ from given distribution
• $H_A$: the observed values are statistically different from expected

### Test Statistics $X^2$

• for each group $i$ we calculate the squared difference between observed and expected
• this difference is normalized with standard error for each group

Values:

• $O_i$ - observed count
• $E_i$ - count expected under $H_0$

Test statistics

• we can think of it as calculating $n$ $Z$ statistics (standardized differences) and sum them up:
• $Z_i = \cfrac{O_i - E_i}{\text{SE}_i}$, each $Z_i$ follows the Normal Model
• note that $\text{SE}_i$ is a sampling distribution under $H_0$, i.e.
• $\text{SE}_i = \sqrt{ E_i }$
• Since we want to minimize the squared error, we calculate
• $X^2 = \sum_{i=1}^{k} Z^2_i = \sum_{i=1}^{k} \cfrac{(O_i - E_i)^2}{ E_i }$
• it's called Pearson's cumulative test stat
• it follows $\chi^2$ distribution with $k - 1$ degrees of freedom, where $k$ is the number of categories

### $p$-values

• typically need only upper-tail values
• because the larger values correspond to stronger evidence against $H_0$

### Conditions

• The observations must be independent
• Sample size should be big enough, so we should have at least 5 at each cell of the expected count table
• $\text{df} \geqslant 2$

## Examples

### Example: County Jurors

• suppose we have a set of 275 jurors from a small county
• they are categorized with their racial group
• are they representing the population of eligible jurors or there's some racial bias?
• (source: OpenIntro, table 6.5)

Race White Black Hispanic Other Total
County 205 26 25 19 275
Ratio 0.75 0.09 0.09 0.07 1.00
Population Ratio 0.72 0.07 0.12 1.00

• It doesn't look like it's precisely representative
• might it be solely due to chance or there's some bias?

Expected values

• What we do is to create another table, where we add expected
• Expected numbers represent the values we expect to see if the sample set was entirely representative

Race White Black Hispanic Other Total
Observed 205 26 25 19 275
Expected 198 19.25 33 24.75 275

And now we calculate the squared difference between observed and expected values for each category:

Test:

• $H_0$: the jurors are random sample, there is no racial bias and the observed counts reflect natural sampling variability
• $H_A$: there's racial bias in the selection

Calculation:

• $X^2 = \cfrac{(205 - 198)^2}{198} + \cfrac{(26 - 19.25)^2}{19.25} + \cfrac{(25 - 33)^2}{33} + \cfrac{(19 - 24.75)^2}{24.75} \approx 5.8$
• $p$-values is quite big: 0.11 - so we can't reject $H_0$

R code Manual:

obs = c(205, 26, 25, 19)
exp = c(198, 19.25, 33, 24.75)

x2 = sum( (obs-exp)^2 / exp )

x = seq(0, 10, 0.01)
y = dchisq(x, df=length(obs) - 1)
plot(x, y, type='l', bty='n', ylab='probability', xlab='x value')

x.min = min(which(x >= x2))
x.max = length(x)
polygon(x=x[c(x.min, x.min:x.max, x.max)],
y=c(0, y[x.min:x.max], 0),
col='orange')

pchisq(x2, df=length(obs) - 1, lower.tail=F)


• Suppose that we have some data from some stock exchange
• we want to test if stock activity on one day is independent from previous day
• the data is taken [1] for 2004-08-04 to 2014-07-01
• example motivated by an example from OpenIntro

Idea

• If the change in the price was positive, we say that that stock was up ($U$), otherwise we say it was down $D$)
• if the days are really independent, then the number of days before seeing $U$ should follow Geometric Distribution.
• How many days should we wait until seeing $U$?

Days 0 1 2 3 4 5 6 7+
Expected 540.5 270.25 135.13 67.56 33.78 16.89 8.45 7.39
450.0 298.00 150.00 85.00 53.00 22.00 13.00 10.00

Test:

• $H_0$: stock marked days are independent from each other
• i.e. we assume that the number of days before seeing $U$ follows geometric distribution
• $H_A$: not independent

Calculations:

• calculate $X^2 = \sum_{i=0}^{7} \cfrac{(O_i - E_i)^2}{E_i} \approx 43.04$
• $k = 8$, $\text{df} = 8 - 1 = 7$,
• calculate the $p$ value: $p \approx 10^{-6}$
• so we reject $H_0$ and conclude that the market days are not independent from each other

R code
sp500 = read.csv('http://goo.gl/lv268V')
values = as.numeric( as.character(sp500\$VALUE) )
change = as.factor(values > 0)
levels(change) = c('D', 'U')

change = change[complete.cases(change)]

y = rep(0, length(change))
y[change == 'U'] = 1
y = c(0, y, 0)
wz = which(y == 0)
streak = diff(wz) - 1

# chi^2 test
act = table(streak)

n = length(streak)
k = length(act)
exp = n / (2 ^ (1:k))

barplot(rbind(exp, act), beside=T, col=c('skyblue', 'orange'))
legend('topright', c('expected', 'actual'), bty='n', pch=15,
col=c('skyblue', 'orange'))

x2 = sum( (act - exp)^2 / exp )

pchisq(x2, df=k - 1, lower.tail=F)
c(x2=x2, theoretic=qchisq(0.95, df=k - 1))

# let's merge the data for 7,8 and 9 days
streak[streak >= 7] = 7
streaks = as.factor(streak)
levels(streaks)[8] = '7+'

act = table(streaks)
exp.n = c(exp[1:7], sum(exp[8:10]))
barplot(rbind(exp.n, act), beside=T, col=c('skyblue', 'orange'))
legend('topright', c('expected', 'actual'), bty='n', pch=15,
col=c('skyblue', 'orange'))

k = length(act)
x2 = sum( (act - exp.n)^2 / exp.n )

pchisq(x2, df=k - 1, lower.tail=F)
c(x2=x2, theoretic=qchisq(0.95, df=k - 1))