Chi-Squared Test of Independence
This is one of $\chi^2$ tests
- one-way table tests - for testing Frequency Tables, Chi-Squared Goodness of Fit Test
- two-way table tests - for testing Contingency Tables, this one
$\chi^2$ Test of Independence
This is a Statistical Test to say if two attributes are dependent or not
- this is used only for descriptive attributes
Setup
- sample of size $N$
- two categorical variables $A$ with $n$ modalities and $B$ with $m$ modalities
- $\text{dom}(A) = { a_1, …, a_n }$ and $\text{dom}(B) = { b_1, …, b_m }$
- we can represent the counts as a Contingency Table
- at each cell $(i, j)$ we denote the observed count as $O_{ij}$
- also, for each row $i$ we calculate the “row total” $r_i = \sum_{j=1}^{m} O_{ij}$
- and for each column $j$ - “column total” $c_j = \sum_{i=1}^{n} O_{ij}$
- $E_{ij}$ are values that we expect to see if $A$ and $B$ are independent
| + Observed Values || | $a_1$ | $a_2$ | ... | $a_n$ | row total | $b_1$ | $O_{11}$ | $O_{21}$ | ... | $O_{n1}$ | $r_1$ || $b_2$ | $O_{12}$ | $O_{22}$ | ... | $O_{n2}$ | $r_2$ || ... | ... | ... | ... | ... | || $b_m$ | $O_{1m}$ | $O_{2m}$ | ... | $O_{nm}$ | $r_m$ || col total | $c_1$ | $c_2$ | ... | $c_n$ | $N$ | | | + Expected Values || | $a_1$ | $a_2$ | ... | $a_n$ | row total | $b_1$ | $E_{11}$ | $E_{21}$ | ... | $E_{n1}$ | $r_1$ || $b_2$ | $E_{12}$ | $E_{22}$ | ... | $E_{n2}$ | $r_2$ || ... | ... | ... | ... | ... | || $b_m$ | $E_{1m}$ | $E_{2m}$ | ... | $E_{nm}$ | $r_m$ || col total | $c_1$ | $c_2$ | ... | $c_n$ | $N$ | |
Test
We want to check if these values are independent, and perform a test for that
- $H_0$: $A$ and $B$ are independent
- $H_A$: $A$ and $B$ are not independent
We conclude that $A$ and $B$ are not independent (i.e. reject $H_0$ if we observe very large differences from the expected values
Expected Counts Calculation
Calculate
- $E_{ij}$ for a cell $(i, j)$ as
- $E_{ij} = \cfrac{\text{row $j$ total}}{\text{table total}} \cdot \text{column $i$ total}$
or, in vectorized form,
- $[ r_1 \ r_2 \ … \ r_n ] \times \left[\begin{matrix} c_1 \ \vdots \ c_m \end{matrix} \right] \times \cfrac{1}{N}$
- with $n$ rows and $m$ columns
$X^2$-statistics Calculation
Statistics
- assuming independence, we would expect that the values in the cells are distributed uniformly with small deviations because of sampling variability
- so we calculate the expected values under $H_0$ and check how far the observed values are from them
- we use the standardized squared difference for that and calculate $X^2$ statistics that under $H_0$ follows $\chi^2$ distribution with $\text{df} = (n - 1) \cdot (m - 1)$
$X^2 = \sum_i \sum_j \cfrac{ (O_{ij} - E_{ij})^2 }{ E_{ij} }$
Apart from checking the $p$-value, we typically also check the $1-\alpha$ percentile of $\chi^2$ with $\text{df} = (n - 1) \cdot (m - 1)$
Size Matters
In examples we can see if the size increases, $H_0$ rejected
- so it’s sensitive to the size
- see also here on the sample size [http://stats.stackexchange.com/questions/108911/why-does-frequentist-hypothesis-testing-become-biased-towards-rejecting-the-null/]
Cramer’s $V$
Cramer’s Coefficient is a Correlation measure for two categorical variables that doesn’t depend on the size like this test
Examples
=== Example: Gender vs City === Consider this dataset
- $\text{Dom}(X) = { x_1 = \text{female}, x_2 = \text{male} }$ (Gender)
- $\text{Dom}(Y) = { y_1 = \text{Blois}, y_2 = \text{Tours} }$ (City)
- $O_{12}$ - # of examples that are $x_1$ (female) and $y_2$ (Tours)
- $E_{12}$ - # of customers that are $x_1$ (female) times # of customers that $y_2$ (live in Tours) divided by the total # of customers
If $X$ and $Y$ are independent
- $\forall i, j : O_{ij} \approx E_{ij}$ should hold
- and $X^2 \approx 0$
Small Data Set
Suppose we have the following data set
- this is our observed values
And let us also build a ideal independent data set
- here we’re assuming that all the values are totally independent
- idea: if independent, should have exactly the same # of male and female in Blois,
- and same # of male/female in Tours
| + Observed Counts || | Male | Female | Total | Blois | 55 | 45 | 100 || Tours | 20 | 30 | 50 || Total | 75 | 75 | 150 | | | + Expected Counts || | Male | Female | Total | Blois | 50 | 50 | 100 || Tours | 25 | 25 | 50 || Total | 75 | 75 | 150 | |
Test
- To compute the value, subtract actual from ideal
- $X^2 = \cfrac{(55-50)^2}{50} + \cfrac{(45-50)^2}{50}+\cfrac{(20-25)^2}{25}+\cfrac{(30-25)^2}{25} = 3$
- with $\text{df}=2$, 95th percentile is 5.99, which is bigger than 3
- also, $p$-value is 0.08 < 0.05
- $\Rightarrow$ the independence hypothesis $H_0$ is not rejected with confidence of 95% (they’re probably independent)
R: ```text only tbl = matrix(data=c(55, 45, 20, 30), nrow=2, ncol=2, byrow=T) dimnames(tbl) = list(City=c(‘B’, ‘T’), Gender=c(‘M’, ‘F’))
chi2 = chisq.test(tbl, correct=F) c(chi2$statistic, chi2$p.value)
#### Bigger Data Set
Now assume that we have the same dataset
- but everything is multiplied by 10
<table>
<tr>
<td>
| | Male | Female | Total | + Observed values || Blois | 550 | 450 | 1000 || Tours | 200 | 300 | 500 || Total | 750 | 750 | 1500 |</td>
<td>
| + Values if independent || | Male | Female | Total | Blois | 500 | 500 | 1000 || Tours | 250 | 250 | 500 || Total | 750 | 750 | 1500 |<td>
</tr>
</table>
Test
- since values grow, the differences between actual and ideal also grow
- and therefore the square of differences also gets bigger
- $X^2 = \cfrac{(550-500)^2}{500} \cfrac{(450-500)^2}{500}+\cfrac{(200-250)^2}{250}+\cfrac{(300-250)^2}{250} = 30$
- with $\text{df} = 2$, 95th percentile is 5.99
- it's less than 30
- and $p$ value is $\approx 10^{-8}$
- $\Rightarrow$ the independence hypothesis is rejected with a confidence of 95%
```scdoc
tbl = matrix(data=c(55, 45, 20, 30) * 10, nrow=2, ncol=2, byrow=T)
dimnames(tbl) = list(City=c('B', 'T'), Gender=c('M', 'F'))
chi2 = chisq.test(tbl, correct=F)
c(chi2$statistic, chi2$p.value)
So we see that the sample size matters
- possible solution is to use Cramer’s Coefficient that tells how much two variables correlate
Example: Search Algorithm
Suppose a search engine wants to test new search algorithms
- e.g. sample of 10k queries
- 5k are served with the old algorithm
- 2.5k are served with
test1
algorithm - 2.5k are served with
test2
algorithm
Test:
- goal to see if there’s any difference in the performance
- $H_0$: algorithms perform equally well
- $H_A$: they perform differently
How do we quantify the quality?
- can view it as interaction with the system in the following way
- success: user clicked on at least one of the provided links and didn’t try a new search
- failure: user performed a new search
So we record the outcomes
+ observed outcomes | current | test 1 | test 2 | total | success | 3511 | 1749 | 1818 | 7078 | failure | 1489 | 751 | 682 | 2922 | 5000 | 2500 | 2500 | 10000 |
The combinations are binned into a two-way table
Expected counts
- Proportion of users who are satisfied with the search is 7078/10000 = 0.7078
- So we expect that 70.78% in 5000 of the current algorithm will also be satisfied
- which gives us expected count of 3539
- i.e. if there is no differences between the groups, 3539 users of the current algorithm group will not perform a new search
+ observed and (expected) outcomes | current | test 1 | test 2 | total | success | 3511 (3539) | 1749 (1769.5) | 1818 (1769.5) | 7078 | failure | 1489 (1461) | 751 (730.5) | 682 (730.5) | 2922 | 5000 | 2500 | 2500 | 10000 |
Now we can compute the $X^2$ test statistics
- $X^2 = \cfrac{( 3511 - 3539 )^2}{ 3539 } + \cfrac{( 1489 - 1461 )^2}{ 1461 } + \cfrac{( 1749 - 1769.5 )^2}{ 1769.5 } + \cfrac{( 751 - 730.5 )^2}{ 730.5 } + \cfrac{( 1818 - 1769.5 )^2}{ 1769.5 } + \cfrac{( 682 - 730.5 )^2}{ 730.5 } = 6.12$
- under $H_0$ it follows $\chi^2$ distribution with $\text{df} = (3 - 1) \cdot (2 - 1)$
- the $p$ value is $p=0.047$, which is less than $\alpha = 0.05$ so we can reject $H_0$
- also, it makes sense to have a look at expected $X^2$ for $\alpha = 0.05$, which is $X^2{\text{exp}} = 5.99$, and $X^2{\text{exp}} < X^2$
R: ```text only obs = matrix(c(3511, 1749, 1818, 1489, 751, 682), nrow=2, ncol=3, byrow=T) dimnames(obs) = list(outcome=c(‘click’, ‘new search’), algorithm=c(‘current’, ‘test 1’, ‘test 2’))
tot = sum(obs) row.tot = rowSums(obs) col.tot = colSums(obs)
exp = row.tot %*% t(col.tot) / tot dimnames(exp) = dimnames(obs)
x2 = sum( (obs - exp)^2 / exp )
df = prod(dim(obs) - 1) pchisq(x2, df=df, lower.tail=F) qchisq(p=0.95, df=df)
Or we can use <code>chisq.test</code> function
```text only
test = chisq.test(obs, correct=F)
test$expected
c('p-value'=test$p.value, test$statistic)
Links
- http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test#Test_of_independence