R – Addressing Predictor Order Change and Logistic Model Estimation Issues in GLM

logisticmulticollinearitypredictorrregression

I am fitting a binomial logistic regression in R using glm. By chance, I have found out that if I change the order of my predictor variables, glm fails to estimate the model. The message I get is unexpected result from lpSolveAPI for primal test.

I am using the safeBinaryRegression package, so I am confident there are no separation issues between my outcome and predictor variables. However, I am not so confident that there are no quasi-separation issues among my predictor variables. Am I correct that if this is the case, then I might be running into multicolinearity, and this is the source of glm not being able to fit the model?

If so, my question is for advice on how to approach the issue. Should I look for the predictor variables highly correlated and omit one of them? Is there any convenient way of doing so for 11 categorical predictors?

What I see right now:

lModel <- glm(mob_change ~ education + gender + start_age + income + dist_change + lu_change + dou_change + marriage + student2work + wh_change,
              data = regression_data, 
              family = binomial())
# Fine, and I can inspect the model. No predictor has std. error > 1.05

# Now if I move the last variable (or any of the last three, for what I've tested) to
# be the first in predictor... 
lModel.3 <- glm(mob_change ~ wh_change + gender + education + start_age + income + dist_change + lu_change + dou_change + marriage + student2work,
            data = regression_data, 
            family = binomial())

Error in separator(X, Y, purpose = "find") : 
  unexpected result from lpSolveAPI for primal test

Best Answer

The order the predictors are entered into the model is of course irrelevant to the question of whether there's separation in the data. The safeBinaryRegression package masks the usual glm function from the stats package, which fits generalized linear models, so that, for logistic regression, glm uses a linear programming algorithm to check for both complete & quasi-complete separation before trying to fit anything. If it finds separation it reports

Separation exists among the sample points.

or

The following terms are causing separation among the sample points:

depending on whether you've asked it just to test for separation or to find the predictors causing it. Otherwise it reports nothing.

Unexpected result from lpSolveAPI for primal test

,however, is a software error message, not a statistical one. You could perhaps try on a machine with more memory, but it's probably safe to trust the results from when you didn't get an error. Using stats:::glm (i.e using the glm function from stats when safeBinaryRegression's loaded) should reach the same results regardless of the order of predictors; it will often report non-convergence or predicted probabilities of nought or one in cases of separation.

Multicollinearity among the predictors is another issue entirely. Generalized variance inflation factors (see the vif function from the car package) are useful for assessing its extent in models with more than one degree of freedom per predictor.

Konis (2007), "Linear programming algorithms for detecting separated data in binary logistic regression models", DPhil., U. Oxf.

Fox & Monette (1992), "Generalized collinearity diagnostics", JASA, 87, pp178–183.

Related Question