Solved – Correlated features produce strange weights in Logistic Regression

logisticmachine learning

I have a data set with highly positively correlated features that I'm classifying with LR. AFAIK correlated weights are not a problem in the same way they are in Naive Bayes – overcounting will not occur with LR.

The strange things that I'm seeing is that some of the highly correlated features assume opposite weights: feature A might be highly positive and feature B highly negative, though not as much. Is this a symptom of something going wrong with optimization or is this expected (a priori I expect A and B to be positive class indicators)

Best Answer

It is possible you are up against collinearity here (I'm assuming that when you say "correlated" you are assuming positive correlation, otherwise the postive/negative difference may make sense). In any case, caution should be used when confronting collinearity in logistic regression. Parameter estimates are often difficult to obtain and unreliable. Of course, this depends on how highly correlated your predictors are. To rule out collinearity, you might want to check something like the Variance Inflation Factor.

If your variables have a high correlation coefficient, but are not truly collinear, then it still isn't incredibly surprising to get the opposite sign behavior you observe (I say this without knowing more details of your problem), depending on what other variables are in your model. Remember that fitting an LR model fits all variables simultaneously to the outcome, so you typically have to interpret the weights as a whole. They may be correlated with each other, but have opposite effects in predicting an outcome, especially if grouped with other variables.

Related Question