Cramer's Coefficient

Note about $\chi^2$ Test of Independence:

  • when the size of a data set increases, the gap between observed and expected values also increases
  • even if the distribution remains unchanged
  • thus we reject the independence hypothesis as the size grows
  • Crammer's Coefficient provides a solution for that


Definition

The Cramer's coefficient $v$

  •  $V = \sqrt{ \cfrac{\chi^2}{\chi^2_\text{max} } }$
  • with $\chi^2_\text{max} = N \times ( \min(N, P) - 1 )$ where
    • $N$ is the number of tuples and $P$ the number of attributes
  • $V \in [0, 1]$
  • 0 - maximal independence, and 1 - maximal correlation


Example

Consider the same example as for $\chi^2$ Test

Small Dataset
Male Female Total
Blois 55 45 100
Tours 20 30 50
Total 75 75 150
Bigger Dataset
Male Female Total
Blois 550 450 1000
Tours 200 300 500
Total 750 750 1500


$V = \sqrt{ 3 / 150 } = \sqrt{ 30 / 1500 } \approx 0.14 $


R

TODO: Expand it

cv.test = function(x,y) {
  CV = sqrt(chisq.test(x, y, correct=FALSE)$statistic /
    (length(x) * (min(length(unique(x)),length(unique(y))) - 1)))
  print.noquote("Cramér V / Phi:")
  return(as.numeric(CV))
}

So we can get Cramer's V as

helpdata = read.csv("http://www.math.smith.edu/r/data/help.csv")
with(helpdata, cv.test(female, homeless)

or

cv.test <- function(x) {
  CV <- sqrt(chisq.test(x, correct=FALSE)$statistic / (sum(x) * min(dim(x) - 1 )))

  ### The result of the Pearson chi-square (without the Yates correction) is divided by the sum of table cells and...
  ### ...multiplied by the smalles number of (row or column) cells minus 1.
  ### The $statistic sends the correct value (the X^2 only) into the sqrt function

  print.noquote("Cramér V / Phi:")
  return(as.numeric(CV))
}


Links


Sources