Solved – Difference between p-value of chi-squared calculation, and p-value in “critical value” lookup table

chi-squared-testscipy

I am learning about chi-squared. It has two steps; a) chi-squared calculation b) critical value

When I calculate chi-squared, the output includes a) chi-squared statistic (220.5) and b) a p-value (1.315…e-48):

chisquare_result = stats.chisquare(df['observed'], df['expected'])
chisquare_result
# Output : Power_divergenceResult(statistic=220.5, pvalue=1.3153258948574585e-48)

When I lookup the critical value, my course materials describe my inputs as a) degrees of freedom, and b) a concept described as a p-value (where we used typical threshold 0.05).

# Desired p-value is 0.05, and (1-0.05) = 0.95
p_value =  0.05
critical_value = stats.chi2.ppf(q=(1-p_value), df=2)
# Output: 5.991464547107979

My questions:

Are the bolded "p-values" I mention above; are they both truly p-values?

a) If they are both p-values, what's the difference between them, why don't they each have the same value?

b) If they are not both p-values; what's the difference between them?
If I can get a p-value as output when calculating chi-squared, why do I need to calculate a critical value?

a) In other words; why isn't the p-value from chi-squared sufficient to reject/fail-to-reject Null Hypothesis?

Best Answer

5.99 is the critical value for 2 degrees of freedom and p = 0.05. So any value of $\chi^2$ greater than 5.99 is significant at the 0.05 level.

220.5 is the critical value for 2 degrees of freedom and p = $1.315 \times 10^{-48}$ so any value greater than 220.5 is significant at that level.

The difference is that in one case you are specifying the level first (0.05) and asking what value of $\chi^2$ is significant beyond that level and in the other case you are starting from your obtained value of $\chi^2$ and asking what the critical value would have been.

Test differences

There are two differences between the two tests used:

The use of likelihood ratio tests versus Wald tests
The use of a sequential tests versus tests for the effect of one variable given the other variables

Since your example data set is huge (1822 complete observations, with 897 events) the first difference doesn’t matter much, so let’s first look at the second difference.

Sequential tests versus tests of one variable given the others

Note that the output from running anova() on the coxph model says Terms added sequentially (first to last). This means that for the first variable, age, we simply test if age is a statistically significant predictor without looking at any other variables. Basically, we test if the model including age fits the data better than a model with no explanatory variables (only an intercept), using a likelihood ratio test (which we can do, since the models are nested). This should give the same result as

anova(coxph(Surv(time, status) ~ age, data=d))

(The actual results differ slightly, because of missing data in the other explanatory variables. If you remove the observations with missing data, you will get the exact same answer.)

For the second variable, sex, we test if sex is statistically significant given age; we compare a model containing only age with one containing both age and sex.

For the third variable, nodes, we test if nodes is statistically significant given both age and sex; we compare a model containing both age and sex with one containing age, sex and nodes. This is the only test where we can compare the result to the one from anova(m1).

Getting tests of one variable given the others for `coxph` models

For getting test results from the coxph models comparable to the ones in the cph models in general, we have several options. One simple method is to use drop1() to compare the full model (three predictors) with ones containing all predictors except one, using likelihood ratio test. First, to avoid some problems with differing number of observations depending on which variables we include, we refit the models on the complete data:

d.comp = na.omit(d[c("time","status","age","sex","nodes")])
m2.comp = update(m2, data=d.comp)

No we drop each predictor in turn:

drop1(m2.comp, test="Chisq")

and get

       Df   AIC     LRT Pr(>Chi)    
<none>    12720                     
age     1 12718   0.031   0.8611    
sex     1 12719   0.929   0.3351    
nodes   1 12851 132.868   <2e-16 ***

As you see, the results are very similar to the ones from the Wald tests from cph.

Wald tests?

So what are the Wald tests? Basically, since all predictors are continuous, they’re just normal, asymptotic z-tests, but with squared test statistics. That is, each test statistic is the square of the $z$ statistic from summary(m2.comp) (and the $z$ statistic is the estimated coefficient divided by its standard error). Example:

summary(m2.comp)

            coef  exp(coef)   se(coef)      z Pr(>|z|)    
age    0.0004934  1.0004936  0.0028216  0.175    0.861    
sex   -0.0645554  0.9374842  0.0669405 -0.964    0.335    
nodes  0.0872323  1.0911501  0.0063330 13.774   <2e-16 ***

The $z$-statistic of sex is $-0.0645554/0.0669405=-0.964$, and $(-0.964)^2=0.93$, which is the chi-square statistic of the Wald test of the sex predictor from the cph model. (For factors and nonlinear variables, the calculations are slightly more complex, taking the correlation between the estimators of the (dummy/transformed) variables used to represent the factor / nonlinear effect into account.)

Which tests to use?

Both sequential tests and tests of one variable given the others makes sense, but they test different hypotheses. The former basically ask ‘if I add this new predictor, does it improve the fit?’ iteratively, for an ordered list of potential predictors. The latter asks ‘given that I include all other predictors, does adding this one improve the fit?’.

Wald tests versus likelihood ratio tests

The other difference between the two tests, i.e., difference 1 mentioned above, is the difference between asymptotic Wald tests (basically relying on the central limit theorem – that you have enough observations that test statistics are approximately normally distributed) and (partial-)likelihood ratio tests (LRTs). For small data sets, the results can differ somewhat. (And even here, the test statistic for the nodes variable is quite different.) Usually, likelihood ratio tests are preferred.

And if you want to compare the Wald and the LRT tests on the same model fitted using ‘coxph()’ (or other normal regression functions), it’s very easy to do using the car package:

library(car)
Anova(m2.comp, test.statistic="Wald") # Equal to anova(m1)
Anova(m2.comp, test.statistic="LR")   # Equal to drop1(m2.comp, test="Chisq")

which gives us:

# LR
      LR Chisq Df Pr(>Chisq)    
age        0.0  1       0.86    
sex        0.9  1       0.34    
nodes    132.9  1     <2e-16 ***

# Wald
            Df  Chisq Pr(>Chisq)    
age          1   0.03       0.86    
sex          1   0.93       0.33    
nodes        1 189.73     <2e-16 ***

Not surprisingly, the $p$-values are (for any practical use) identical.

Best Answer

Related Solutions

Solved – Proper use and interpretation of chi-squared test

Solved – Difference in chi-squared calculated by anova from cph and coxph