I have some items described by 43 categories like this:
Dataset Item Category1 Category2... Category43
D1 Item1 1 0 0 ... 1 ...
D1 Item3 1 0 0 ... 1 ...
D2 Item4 1 0 0 ... 1 ...
..
What I did was to create a frequency table like this
Dataset Category1 Category2... Category43
D1 617 388 ... 827
D2 1234 7272 ... 1237
I am testing to see if there is a relationship between the dataset type and the category frequency counts.
I have the following data as the output of dput
:
structure(list(data.OldFrequency = c(617L, 388L, 6L, 9L, 1344L,
857L, 30L, 63L, 60L, 22L, 23L, 107L, 9L, 16L, 9L, 10L, 14L, 28L,
9L, 174L, 245L, 103L, 4096L, 121L, 6L, 48L, 189L, 33L, 1426L,
64L, 16L, 135L, 77L, 26L, 110L, 44L, 75L, 1610L, 1022L, 38L,
1578L, 242L, 67L), data.NewFrequency = c(1220L, 959L, 307L, 29L,
5093L, 771L, 65L, 125L, 120L, 41L, 187L, 203L, 11L, 87L, 20L,
159L, 45L, 68L, 60L, 11L, 644L, 51L, 7053L, 159L, 6L, 162L, 208L,
52L, 3277L, 27L, 594L, 79L, 95L, 119L, 96L, 84L, 180L, 2991L,
2227L, 34L, 2249L, 37L, 29L)), .Names = c("data.OldFrequency",
"data.NewFrequency"), row.names = c(NA, -43L), class = "data.frame")
Running chisq.test
using this gives me the following:
Pearson's Chi-squared test
data: d
X-squared = 2551.405, df = 42, p-value < 2.2e-16
Warning message:
In chisq.test(d) : Chi-squared approximation may be incorrect
I am confused on what null hypothesis this is testing and what the implications of this are. Can someone please help me understand how to interpret this? I am not a statistician and would love if someone could explain this in simple words. And how would I fix the warning message?
Best Answer
The $\chi^2$ test you have done there is testing whether there is a relationship between new/old and $Category$. The null hypothesis is that there is no relationship, meaning the proportion of old items in each $Category$ is the same as the proportion of new items in each $Category$.
The $\chi^2$ test is constructed by comparing the observed counts to expected counts that would come about under that null hypothesis. The code below manually recreates what the
chisq.test
function does. This figure is then compared to the $\chi^2$ distribution with $(43-1)\times(2-1)=42$ degrees freedom which is the distribution it would have under the null hypothesis. Any suspiciously large figure (ie unlikely to be that large under the null hypothesis) casts doubt on the null hypothesis of no relationship.I am pretty sure the warning comes about because one of the expected values is below five. In this case I don't think I'd worry too much about that - there's only one of the cells with expected values below five, and it's not influential on the big test statistic you get.