Goodness-of-Fit Test – How to Perform for Contingency Tables

chi-squared-testcontingency tablesgoodness of fithypothesis testinglog-linear

Let's say I have the following table from a sample of 462 people:

Gender Happy Meh Sad
Men 70 32 120
Women 100 30 110

I don't want to test it against the hypothesis of independence, but against the following hypothesized distribution:

Gender Happy Meh Sad
Men 46 (0.1) 139 (0.3) 92 (0.2)
Women 46 (0.1) 116 (0.25) 23 (0.05)

in R, according to the documentation, chisq.test only works for testing independence in contingency tables, goodness-of-fit test being only available for "flat" tables. So I was thinking of simply flattening the two contingency tables, then applying a standard goodness-of-fit test, for example in R something like:

observed_data = c(70, 32, 120,100,30,110)
hypothesized_data = c(46, 139, 92,46, 116, 23)
hypothesized_prop = hypothesized_data / sum(hypothesized_data ) 

res = chisq.test(observed_data, p=hypothesized_prop )
#results in a chi-square statistic (res$statistic) of 559.6473 

#now let's compute the p-value for 2 degrees of freedom:
pchisq(res$statistic, df=2, lower.tail=FALSE) 
#results in a p-value of 2.979478e-122

My questions are:

  1. Is this approach correct from a theoretical point of view? (happy to have comments on my code too, even if it's not the heart of my question).
  2. If it is correct, is it also correct to extend this approach to three-way contingency tables or more (e.g. four categorical variables: gender, mood, age group, income group)? If it's not correct for contingency tables with more than 2 variables, what approach would be correct to test if an observed distribution fits a given distribution (in other cases than independence testing)?

Regarding my second question, I've found this interesting article ("Common statistical tests are linear models"), and log-linear models may be the answer. However the article seems to use log-linear models only to test independence, so I'm not sure how to approach this from a theoretical point of view (i.e. is it even correct to use log-linear models for this kind of question), and from a practical point of view (i.e. is it actually possible to do it in R, Python, or other statistical tools?).

Thanks,

Best Answer

In brief, you can use the chi-squared goodness-of-fit test to test whether data was generated from a hypothesized distribution. There's no need to formulate the data as a contingency table because you're not using the marginal probabilities. You've illustrated this with a hypothesized distribution with six categories but you could also use this with a hypothesized distribution with nine categories or sixteen categories or more.

For the goodness-of-fit test, the degrees of freedom are n - 1. So if you have six categories, you have five (not two) degrees of freedom. The 'chisq.test' in R though will give you these degrees of freedom in the output so no need to use 'pchisq' yourself.

observed_data <- c(70, 32, 120,100,30,110)
hypothesized_data <- c(46, 139, 92,46, 116, 23)
hypothesized_prop <- hypothesized_data / sum(hypothesized_data) 

chisq.test(observed_data, p = hypothesized_prop)

I don't know if what you're doing is theoretical correct. This would probably depend on how you've come up with your hypothesized distribution.