Solved – Investigating robustness of logistic regression against violation of linearity of logit

assumptionslogisticreferencesregressionrobust

I am conducting a logistic regression with a binary outcome (start and not start). My mix of predictors are all either continuous or dichotomous variables.

Using the Box-Tidwell approach, one of my continuous predictors potentially violates the assumption of linearity of the logit. There is no indication from goodness-of-fit statistics that fit is problematic.

I have subsequently run the regression model again, substituting the original continuous variable with: firstly, a square root transformation and secondly, a dichotomous version of the variable.

On inspection of the output, it seems that goodness-of-fit improves marginally but residuals become problematic. Parameter estimates, standard errors, and $\exp(\beta)$ remain relatively similar. The interpretation of the data does not change in terms of my hypothesis, across the 3 models.

Therefore, in terms of usefulness of my results and sense of interpretation of data, it seems appropriate to report the regression model using the original continuous variable.

I am wondering this:

  1. When is logistic regression robust against the potential violation
    of the linearity of logit assumption?
  2. Given my above example, does it seem acceptable to include the
    original continuous variable in the model?
  3. Are there any references or guides out there for recommending when
    it is satisfactory to accept that the model is robust against the
    potential violation of linearity of the logit?

Best Answer

The linearity assumption is so commonly violated in regression that it should be called a surprise rather than an assumption. Like other regression models, the logistic model is not robust to nonlinearity when you falsely assume linearity. Rather than detect nonlinearity using residuals or omnibus goodness of fit tests, it is better to use direct tests. For example, expand continuous predictors using regression splines and do a composite test of all the nonlinear terms. Better still don't test the terms and just expect nonlinearity. This approach is much better than trying different single-slope choices of transformations such as square root, log, etc., because statistical inference arise after such analyses will be incorrect because it does not have large enough numerator degrees of freedom.

Here's an example in R.

require(rms)
f <- lrm(y ~ rcs(age,4) + rcs(blood.pressure,5) + sex + rcs(height,4))
# Fits restricted cubic splines in 3 variables with default knots
# 4, 5, 4 knots = 2, 3, 2 nonlinear terms
Function(f)   # display algebraic form of fit
anova(f)      # obtain individual + combined linearity tests