Solved – Skewed Distributions for Logistic Regression

logisticrregressionsplines

I have been developing a logistic regression model based on retrospective data from a national trauma database of head injury in the UK. The key outcome is 30 day mortality (denoted as Outcome30 measure). Other measures across the whole database with published evidence of significant effect on outcome in previous studies include:

Yeardecimal - Date of procedure = 1994.0-2013.99
inctoCran - Time from head injury to craniotomy in minutes = 0-2880 (After 2880 minutes is defined as a separate diagnosis)
ISS - Injury Severity Score = 1-75
Age - Age of patient = 16.0-101.5
GCS - Glasgow Coma Scale = 3-15
Sex - Gender of patient = Male or Female
rcteyemi - Pupil reactivity (1 = neither, 2 = one, 3 = both)
neuroFirst2 - Location of admission (Neurosurgical unit or not)
Other - other traums (0 - No, 1 - Yes)
othopYN - Other operation required
LOS - Length of stay in days
LOSCC - Length of stay in critical care in days 

When I conduct univariate analysis of the variables, I have conducted a logistic regression for each continuous variable. I am unable to model Yeardecimal however, with the following result:

> rcs.ASDH<-lrm(formula = Survive ~ Yeardecimal, data = ASDH_Paper1.1)
singular information matrix in lrm.fit (rank= 1 ).  Offending variable(s):
Yeardecimal 
Error in lrm(formula = Survive ~ Yeardecimal, data = ASDH_Paper1.1) : 
  Unable to fit model using “lrm.fit”

However, the restricted cubic spline works:

> rcs.ASDH<-lrm(formula = Survive ~ rcs(Yeardecimal), data = ASDH_Paper1.1)
> 
> rcs.ASDH

Logistic Regression Model

lrm(formula = Survive ~ rcs(Yeardecimal), data = ASDH_Paper1.1)

                      Model Likelihood     Discrimination    Rank Discrim.    
                         Ratio Test            Indexes          Indexes       
Obs          5998    LR chi2     106.61    R2       0.027    C       0.578    
 0           1281    d.f.             4    g        0.319    Dxy     0.155    
 1           4717    Pr(> chi2) <0.0001    gr       1.376    gamma   0.160    
max |deriv| 2e-08                          gp       0.057    tau-a   0.052    
                                           Brier    0.165                     

               Coef     S.E.    Wald Z Pr(>|Z|)
Intercept      -68.3035 45.8473 -1.49  0.1363  
Yeardecimal      0.0345  0.0229  1.51  0.1321  
Yeardecimal'     0.1071  0.0482  2.22  0.0262  
Yeardecimal''   -2.0008  0.6340 -3.16  0.0016  
Yeardecimal'''  11.3582  4.0002  2.84  0.0045  

Could anyone explain why this is? I am nervous about using a mode complicated model if I am unable to model with a simpler approach.

I am currently using restricted cubic splines to model Age, ISS and Yeardecimal. Would anyone recommend any alternative approach?

Best Answer

The date as a predictor may be failing because it is highly collinear with the constant. If you enter it as a year, it's variability is about 10/2000 = 0.005 (in fact less because most of your data are in the more recent years), and when squared it becomes 4e-6. When inverting a matrix with eigenvalues 1 and 4e-6, the package that you use may decide it is a zero in finite precision arithmetics, and throw this error message. The solution is simple -- center your data, at least approximately, by subtracting 2000 from the year.