Solved – Ordinal Logistic Regression – Strange Results

ordered-logitordinal-dataregression

I've been having struggling with the analysis of ordinal categorical data that I have (against 3 binary categorical explanatory variables). The problem is that for one of my data sets,running ordinal logistic regression in SPSS gives an unusual result, which raises questions about the validity of this statistical testing approach to my data:

SPSS Parameter Estimates for Ordinal Logistic Regression

Full SPSS syntax and output can be accessed here

The question arises: why is no statistical result given for the non-redundant level of the age variable, StErr = 0, confidence interval does not exist, logit Estimate so high…? Exponentiation of logit gives some ridiculously high odds ratio… It’s like a perfect result with 100% certainty? Is this statistical test and outcome valid? Can someone explain what is happening in layman's terms? Should I be using a different statistical test than ordinal logistic regression? And please suggest how I should analyse and report it for a scientific publication.

The raw data can be viewed for further insight:

enter image description here

I have also made available the SPSS data file in case you wanted to run the test for yourself:

Access here: SPSS .sav file

I have also attached a visualisation of the data in the form of stacked columns, to help you to quickly familiarise yourself with the character of my data.

enter image description here

Best Answer

The problem here is what is called separation. For one level of your predictor (age) you have zero occurrences of one level of your outcome (collagen). The program is driving its estimate of the coefficient upwards to infinity and in this case stopping at an arbitrary value which as you point out when exponentiated gives you a very large odds ratio. What you do about it is really a substantive question to do with the underlying science rather than a statistical one.

Added as edit

You ask what separation means. In logistic regression it means that for some value of the predictor (or some linear combination of the predictors) only one value of the outcome occurs. You have a variant of this because you are using an ordinal model rather than logistic regression. What you do is a matter of your underlying science. Finding a perfect predictor may be really desirable for you or you may feel that this is just a chance finding caused by your sample size. There are models for logistic regression (exact logistic regression, Firth's bias corrected method) but I do not know whether they are available for ordinal models.

Further addition

Investigating this further with two R packages (I do not use SPSS) reveals that they agree on there being a problem but give slightly different output. The function polr from MASS basically agrees with the output in the original question and inspecting the variance-covariance matrix of the coefficients shows that the estimated correlation between age and one of the intercepts is close to 1 in absolute value.

> fit.polr  summary(fit.polr)
Call:
polr(formula = coll ~ age + treat + gene, data = res, Hess = TRUE)

Coefficients:
            Value Std. Error  t value
ageyoung -18.1584     0.2874 -63.1855
treatNMN  -0.4546     0.7457  -0.6096
geneig     0.4464     0.5823   0.7667

Intercepts:
    Value    Std. Error t value 
0|1 -17.4775   0.2871   -60.8811
1|2   0.7467   0.5709     1.3080

Residual Deviance: 68.77266 
AIC: 78.77266 
> cov2cor(vcov(fit.polr))
            ageyoung    treatNMN     geneig         0|1        1|2
ageyoung  1.00000000 -0.06065359 -0.5308460 -0.99998399 -0.3224864
treatNMN -0.06065359  1.00000000  0.1169754  0.06384212  0.5512079
geneig   -0.53084598  0.11697536  1.0000000  0.53346172  0.6110366
0|1      -0.99998399  0.06384212  0.5334617  1.00000000  0.3278369
1|2      -0.32248636  0.55120792  0.6110366  0.32783686  1.0000000

The function clm from the ordinal package rejects the model on the grounds that the Hessian is singular. Using @FrankHarrell suggestion to investigate LRT we find that age is indeed significant.

> fit.clm  summary(fit.clm)
formula: coll ~ age + treat + gene
data:    res

 link  threshold nobs logLik AIC   niter max.grad cond.H 
 logit flexible  54   -34.39 78.77 21(0) 5.63e-09 4.5e+09

Coefficients:
         Estimate Std. Error z value Pr(>|z|)
ageyoung -23.3992         NA      NA       NA
treatNMN  -0.4546         NA      NA       NA
geneig     0.4465         NA      NA       NA

Threshold coefficients:
    Estimate Std. Error z value
0|1 -22.7184         NA      NA
1|2   0.7467         NA      NA
> drop1(fit.clm, test = "Chisq")
Single term deletions

Model:
coll ~ age + treat + gene
       Df     AIC    LRT  Pr(>Chi)    
     78.773                     
age     1 109.773 33.000 9.215e-09 ***
treat   1  77.151  0.378    0.5387    
gene    1  77.365  0.592    0.4417    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note the a LRT of age is possible here.

My feeling is that there is insufficient information here in the data-set to fit the model with any confidence. If we had a single observation for age == old & coll == 0 then both functions would fit the model (and give identical results) so the problem here is the zeroes for both (age == old & coll == 0) & (age == young & coll == 2).

Related Question