Goodness-of-Fit Test – How to Perform for Contingency Tables

chi-squared-testcontingency tablesgoodness of fithypothesis testinglog-linear

Let's say I have the following table from a sample of 462 people:

Gender	Happy	Meh	Sad
Men	70	32	120
Women	100	30	110

I don't want to test it against the hypothesis of independence, but against the following hypothesized distribution:

Gender	Happy	Meh	Sad
Men	46 (0.1)	139 (0.3)	92 (0.2)
Women	46 (0.1)	116 (0.25)	23 (0.05)

in R, according to the documentation, chisq.test only works for testing independence in contingency tables, goodness-of-fit test being only available for "flat" tables. So I was thinking of simply flattening the two contingency tables, then applying a standard goodness-of-fit test, for example in R something like:

observed_data = c(70, 32, 120,100,30,110)
hypothesized_data = c(46, 139, 92,46, 116, 23)
hypothesized_prop = hypothesized_data / sum(hypothesized_data ) 

res = chisq.test(observed_data, p=hypothesized_prop )
#results in a chi-square statistic (res$statistic) of 559.6473 

#now let's compute the p-value for 2 degrees of freedom:
pchisq(res$statistic, df=2, lower.tail=FALSE) 
#results in a p-value of 2.979478e-122

My questions are:

Is this approach correct from a theoretical point of view? (happy to have comments on my code too, even if it's not the heart of my question).
If it is correct, is it also correct to extend this approach to three-way contingency tables or more (e.g. four categorical variables: gender, mood, age group, income group)? If it's not correct for contingency tables with more than 2 variables, what approach would be correct to test if an observed distribution fits a given distribution (in other cases than independence testing)?

Regarding my second question, I've found this interesting article ("Common statistical tests are linear models"), and log-linear models may be the answer. However the article seems to use log-linear models only to test independence, so I'm not sure how to approach this from a theoretical point of view (i.e. is it even correct to use log-linear models for this kind of question), and from a practical point of view (i.e. is it actually possible to do it in R, Python, or other statistical tools?).

Thanks,

Best Answer

In brief, you can use the chi-squared goodness-of-fit test to test whether data was generated from a hypothesized distribution. There's no need to formulate the data as a contingency table because you're not using the marginal probabilities. You've illustrated this with a hypothesized distribution with six categories but you could also use this with a hypothesized distribution with nine categories or sixteen categories or more.

For the goodness-of-fit test, the degrees of freedom are n - 1. So if you have six categories, you have five (not two) degrees of freedom. The 'chisq.test' in R though will give you these degrees of freedom in the output so no need to use 'pchisq' yourself.

observed_data <- c(70, 32, 120,100,30,110)
hypothesized_data <- c(46, 139, 92,46, 116, 23)
hypothesized_prop <- hypothesized_data / sum(hypothesized_data) 

chisq.test(observed_data, p = hypothesized_prop)

I don't know if what you're doing is theoretical correct. This would probably depend on how you've come up with your hypothesized distribution.

Related Solutions

Solved – Correspondence analysis for a three-way contingency table

assuming your data are in a data.frame called dt, you can turn it into a table object using xtabs

tbl <- xtabs(Freq ~ Species + Eco_region4, data=dt)
tbl
          Eco_region4
   Species A1 A2 A3 B1 B2 B3 C1 C2 C3
        S1 10 10  2 13  9 15 12 12 18
        S2 12  6  9 15  8 12 18  0 10
        S3  8 11 13  7 13 13 20 11 16

You can get a three dimensional table by adding Eco_region3 to the end of the formula, but then the correspondence analysis would fail because of the nested structure of your data.

You can perform correspondence analysis with the ca function from the ca package.

Solved – How to interpret a two-dimensional contingency table

In general, there isn't much to a 2-way contingency table, but you are trying to unpack this at such a level of detail that some confusions are arising. Typically, with a simple contingency table like this, people just want to know if the variables (sex and citizen) are independent. To assess that, you can run a chi-squared test:

chisq.test(tab.sex.citizen)
#  Pearson's Chi-squared test with Yates' continuity correction
# 
# data:  tab.sex.citizen
# X-squared = 2.389, df = 1, p-value = 0.1222

You can perform a likelihood ratio version of this test (the chi-squared test above is a score test), by performing a nested model test of the Poisson GLM you ran against the full (saturated) model:

phd.mod.indep <- glm(Freq ~ sex + citizen, family=poisson, data=AMS2)
phd.mod.sat   <- glm(Freq ~ sex * citizen, family=poisson, data=AMS2)
anova(phd.mod.indep, phd.mod.sat, test="LRT")
# Analysis of Deviance Table
# 
# Model 1: Freq ~ sex + citizen
# Model 2: Freq ~ sex * citizen
#   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
# 1         1     2.5721                     
# 2         0     0.0000  1   2.5721   0.1088
deviance(phd.mod.indep)                                        # [1] 2.572123
deviance(phd.mod.sat)                                          # [1] 3.308465e-14
1-pchisq(deviance(phd.mod.indep)-deviance(phd.mod.sat), df=1)  # [1] 0.1087617

You can also get the Wald test of the interaction term (which is the test of independence):

summary(phd.mod.sat)
# ...
# Coefficients:
#                   Estimate Std. Error z value Pr(>|z|)    
# (Intercept)        5.56068    0.06202  89.663  < 2e-16 ***
# sexMale            0.65592    0.07643   8.582  < 2e-16 ***
# citizenUS         -0.25241    0.09379  -2.691  0.00712 ** 
# sexMale:citizenUS  0.18214    0.11373   1.602  0.10926    
# ...
#     Null deviance: 1.9148e+02  on 3  degrees of freedom
# Residual deviance: 3.3085e-14  on 0  degrees of freedom
# AIC: 38.586
# ...

(To read about score vs Wald vs likelihood ratio tests, see my answer here: Why do my p-values differ between logistic regression output, chi-squared test, and the confidence interval for the OR?)

Note however, that the way you conducted your test of phd.mod.indep is incorrect (see here: Test GLM model using null and model deviances). The test of that model against the null model is the test of whether all cells have the same probability in the population. It would be implemented as follows:

1-pchisq(191.5-2.57, df=3-1)  # [1] 0

Setting aside the test of whether all cell probabilities are equal (which I doubt is of much interest to you), if you believed that there was an association between sex and citizen, you would not interpret the model phd.mod.indep. That would be a misspecified model.

Best Answer

Related Solutions

Solved – Correspondence analysis for a three-way contingency table

Solved – How to interpret a two-dimensional contingency table

Related Question