Processing math: 100%

Chi-Squared Test of Independence

This is one of χ2 tests


χ2 Test of Independence

This is a Statistical Test to say if two attributes are dependent or not

  • this is used only for descriptive attributes


Setup

  • sample of size N
  • two categorical variables A with n modalities and B with m modalities
  • dom(A)={a1,...,an} and dom(B)={b1,...,bm}
  • we can represent the counts as a Contingency Table
  • at each cell (i,j) we denote the observed count as Oij
  • also, for each row i we calculate the "row total" ri=mj=1Oij
  • and for each column j - "column total" cj=ni=1Oij
  • Eij are values that we expect to see if A and B are independent


Observed Values
a1 a2 ... an row total
b1 O11 O21 ... On1 r1
b2 O12 O22 ... On2 r2
... ... ... ... ...
bm O1m O2m ... Onm rm
col total c1 c2 ... cn N
Expected Values
a1 a2 ... an row total
b1 E11 E21 ... En1 r1
b2 E12 E22 ... En2 r2
... ... ... ... ...
bm E1m E2m ... Enm rm
col total c1 c2 ... cn N


Test

We want to check if these values are independent, and perform a test for that

  • H0: A and B are independent
  • HA: A and B are not independent


We conclude that A and B are not independent (i.e. reject H0 if we observe very large differences from the expected values


Expected Counts Calculation

Calculate

  • Eij for a cell (i,j) as
  • Eij=row j totaltable totalcolumn i total

or, in vectorized form,

  • [r1 r2 ... rn]×[c1cm]×1N
  • with n rows and m columns


X2-statistics Calculation

Statistics

  • assuming independence, we would expect that the values in the cells are distributed uniformly with small deviations because of sampling variability
  • so we calculate the expected values under H0 and check how far the observed values are from them
  • we use the standardized squared difference for that and calculate X2 statistics that under H0 follows χ2 distribution with df=(n1)(m1)


X2=ij(OijEij)2Eij


Apart from checking the p-value, we typically also check the 1α percentile of χ2 with df=(n1)(m1)


Size Matters

In examples we can see if the size increases, H0 rejected

  • so it's sensitive to the size
  • see also here on the sample size [1]


Cramer's V

Cramer's Coefficient is a Correlation measure for two categorical variables that doesn't depend on the size like this test


Examples

Example: Gender vs City

Consider this dataset

  • Dom(X)={x1=female,x2=male} (Gender)
  • Dom(Y)={y1=Blois,y2=Tours} (City)
  • O12 - # of examples that are x1 (female) and y2 (Tours)
  • E12 - # of customers that are x1 (female) times # of customers that y2 (live in Tours) divided by the total # of customers

If X and Y are independent

  • i,j:OijEij should hold
  • and X20


Small Data Set

Suppose we have the following data set

  • this is our observed values

And let us also build a ideal independent data set

  • here we're assuming that all the values are totally independent
  • idea: if independent, should have exactly the same # of male and female in Blois,
  • and same # of male/female in Tours
Observed Counts
Male Female Total
Blois 55 45 100
Tours 20 30 50
Total 75 75 150
Expected Counts
Male Female Total
Blois 50 50 100
Tours 25 25 50
Total 75 75 150


Test

  • To compute the value, subtract actual from ideal
  • X2=(5550)250+(4550)250+(2025)225+(3025)225=3
  • with df=2, 95th percentile is 5.99, which is bigger than 3
  • also, p-value is 0.08 < 0.05
  • the independence hypothesis H0 is not rejected with confidence of 95% (they're probably independent)


R:

tbl = matrix(data=c(55, 45, 20, 30), nrow=2, ncol=2, byrow=T)
dimnames(tbl) = list(City=c('B', 'T'), Gender=c('M', 'F'))

chi2 = chisq.test(tbl, correct=F)
c(chi2$statistic, chi2$p.value)


Bigger Data Set

Now assume that we have the same dataset

  • but everything is multiplied by 10
Male Female Total
Observed values
Blois 550 450 1000
Tours 200 300 500
Total 750 750 1500
Values if independent
Male Female Total
Blois 500 500 1000
Tours 250 250 500
Total 750 750 1500


Test

  • since values grow, the differences between actual and ideal also grow
  • and therefore the square of differences also gets bigger
  • X2=(550500)2500(450500)2500+(200250)2250+(300250)2250=30
  • with df=2, 95th percentile is 5.99
  • it's less than 30
  • and p value is 108
  • the independence hypothesis is rejected with a confidence of 95%


tbl = matrix(data=c(55, 45, 20, 30) * 10, nrow=2, ncol=2, byrow=T)
dimnames(tbl) = list(City=c('B', 'T'), Gender=c('M', 'F'))

chi2 = chisq.test(tbl, correct=F)
c(chi2$statistic, chi2$p.value)

So we see that the sample size matters


Example: Search Algorithm

Suppose a search engine wants to test new search algorithms

  • e.g. sample of 10k queries
  • 5k are served with the old algorithm
  • 2.5k are served with test1 algorithm
  • 2.5k are served with test2 algorithm


Test:

  • goal to see if there's any difference in the performance
  • H0: algorithms perform equally well
  • HA: they perform differently


How do we quantify the quality?

  • can view it as interaction with the system in the following way
  • success: user clicked on at least one of the provided links and didn't try a new search
  • failure: user performed a new search


So we record the outcomes

observed outcomes
current test 1 test 2 total
success 3511 1749 1818 7078
failure 1489 751 682 2922
5000 2500 2500 10000


The combinations are binned into a two-way table

Expected counts

  • Proportion of users who are satisfied with the search is 7078/10000 = 0.7078
  • So we expect that 70.78% in 5000 of the current algorithm will also be satisfied
  • which gives us expected count of 3539
  • i.e. if there is no differences between the groups, 3539 users of the current algorithm group will not perform a new search


observed and (expected) outcomes
current test 1 test 2 total
success 3511 (3539) 1749 (1769.5) 1818 (1769.5) 7078
failure 1489 (1461) 751 (730.5) 682 (730.5) 2922
5000 2500 2500 10000


Now we can compute the X2 test statistics

  • X2=(35113539)23539+(14891461)21461+(17491769.5)21769.5+(751730.5)2730.5+(18181769.5)21769.5+(682730.5)2730.5=6.12
  • under H0 it follows χ2 distribution with df=(31)(21)
  • the p value is p=0.047, which is less than α=0.05 so we can reject H0
  • 2b5c8fc6e4f5414fa115c7e1ffd00375.png
  • also, it makes sense to have a look at expected X2 for α=0.05, which is X2exp=5.99, and X2exp<X2


R:

obs = matrix(c(3511, 1749, 1818, 1489, 751, 682), nrow=2, ncol=3, byrow=T)
dimnames(obs) = list(outcome=c('click', 'new search'),
                     algorithm=c('current', 'test 1', 'test 2'))

tot = sum(obs)
row.tot = rowSums(obs)
col.tot = colSums(obs)

exp = row.tot %*% t(col.tot) / tot
dimnames(exp) = dimnames(obs)

x2 = sum( (obs - exp)^2 / exp )

df = prod(dim(obs) - 1)
pchisq(x2, df=df, lower.tail=F)
qchisq(p=0.95, df=df)

Or we can use chisq.test function

test = chisq.test(obs, correct=F)
test$expected
c('p-value'=test$p.value, test$statistic)


Links

Sources