Solved – What do these chi-square results mean

chi-squared-testindependencer

I have some items described by 43 categories like this:

Dataset Item Category1 Category2...               Category43
D1      Item1 1        0              0    ...    1 ...
D1      Item3 1        0              0    ...    1 ...
D2      Item4 1        0              0    ...    1 ...
..

What I did was to create a frequency table like this

Dataset Category1 Category2...               Category43
D1        617       388    ...                 827
D2        1234      7272   ...                 1237

I am testing to see if there is a relationship between the dataset type and the category frequency counts.

I have the following data as the output of dput:

structure(list(data.OldFrequency = c(617L, 388L, 6L, 9L, 1344L, 
857L, 30L, 63L, 60L, 22L, 23L, 107L, 9L, 16L, 9L, 10L, 14L, 28L, 
9L, 174L, 245L, 103L, 4096L, 121L, 6L, 48L, 189L, 33L, 1426L, 
64L, 16L, 135L, 77L, 26L, 110L, 44L, 75L, 1610L, 1022L, 38L, 
1578L, 242L, 67L), data.NewFrequency = c(1220L, 959L, 307L, 29L, 
5093L, 771L, 65L, 125L, 120L, 41L, 187L, 203L, 11L, 87L, 20L, 
159L, 45L, 68L, 60L, 11L, 644L, 51L, 7053L, 159L, 6L, 162L, 208L, 
52L, 3277L, 27L, 594L, 79L, 95L, 119L, 96L, 84L, 180L, 2991L, 
2227L, 34L, 2249L, 37L, 29L)), .Names = c("data.OldFrequency", 
"data.NewFrequency"), row.names = c(NA, -43L), class = "data.frame")

Running chisq.test using this gives me the following:

Pearson's Chi-squared test

data:  d 
X-squared = 2551.405, df = 42, p-value < 2.2e-16

Warning message:
In chisq.test(d) : Chi-squared approximation may be incorrect

I am confused on what null hypothesis this is testing and what the implications of this are. Can someone please help me understand how to interpret this? I am not a statistician and would love if someone could explain this in simple words. And how would I fix the warning message?

Best Answer

The $\chi^2$ test you have done there is testing whether there is a relationship between new/old and $Category$. The null hypothesis is that there is no relationship, meaning the proportion of old items in each $Category$ is the same as the proportion of new items in each $Category$.

The $\chi^2$ test is constructed by comparing the observed counts to expected counts that would come about under that null hypothesis. The code below manually recreates what the chisq.test function does. This figure is then compared to the $\chi^2$ distribution with $(43-1)\times(2-1)=42$ degrees freedom which is the distribution it would have under the null hypothesis. Any suspiciously large figure (ie unlikely to be that large under the null hypothesis) casts doubt on the null hypothesis of no relationship.

I am pretty sure the warning comes about because one of the expected values is below five. In this case I don't think I'd worry too much about that - there's only one of the cells with expected values below five, and it's not influential on the big test statistic you get.

rows <- apply(d,2, sum)
cols <- apply(d,1, sum)
expect <- outer(cols, rows) / sum(cols)
round(expect)
sum(expect<5)
sum((d-expect)^2/expect)
Related Question