Let's say I have the following table from a sample of 462 people:
Gender | Happy | Meh | Sad |
---|---|---|---|
Men | 70 | 32 | 120 |
Women | 100 | 30 | 110 |
I don't want to test it against the hypothesis of independence, but against the following hypothesized distribution:
Gender | Happy | Meh | Sad |
---|---|---|---|
Men | 46 (0.1) | 139 (0.3) | 92 (0.2) |
Women | 46 (0.1) | 116 (0.25) | 23 (0.05) |
in R, according to the documentation, chisq.test
only works for testing independence in contingency tables, goodness-of-fit test being only available for "flat" tables. So I was thinking of simply flattening the two contingency tables, then applying a standard goodness-of-fit test, for example in R something like:
observed_data = c(70, 32, 120,100,30,110)
hypothesized_data = c(46, 139, 92,46, 116, 23)
hypothesized_prop = hypothesized_data / sum(hypothesized_data )
res = chisq.test(observed_data, p=hypothesized_prop )
#results in a chi-square statistic (res$statistic) of 559.6473
#now let's compute the p-value for 2 degrees of freedom:
pchisq(res$statistic, df=2, lower.tail=FALSE)
#results in a p-value of 2.979478e-122
My questions are:
- Is this approach correct from a theoretical point of view? (happy to have comments on my code too, even if it's not the heart of my question).
- If it is correct, is it also correct to extend this approach to three-way contingency tables or more (e.g. four categorical variables: gender, mood, age group, income group)? If it's not correct for contingency tables with more than 2 variables, what approach would be correct to test if an observed distribution fits a given distribution (in other cases than independence testing)?
Regarding my second question, I've found this interesting article ("Common statistical tests are linear models"), and log-linear models may be the answer. However the article seems to use log-linear models only to test independence, so I'm not sure how to approach this from a theoretical point of view (i.e. is it even correct to use log-linear models for this kind of question), and from a practical point of view (i.e. is it actually possible to do it in R, Python, or other statistical tools?).
Thanks,
Best Answer
In brief, you can use the chi-squared goodness-of-fit test to test whether data was generated from a hypothesized distribution. There's no need to formulate the data as a contingency table because you're not using the marginal probabilities. You've illustrated this with a hypothesized distribution with six categories but you could also use this with a hypothesized distribution with nine categories or sixteen categories or more.
For the goodness-of-fit test, the degrees of freedom are n - 1. So if you have six categories, you have five (not two) degrees of freedom. The 'chisq.test' in R though will give you these degrees of freedom in the output so no need to use 'pchisq' yourself.
I don't know if what you're doing is theoretical correct. This would probably depend on how you've come up with your hypothesized distribution.