Solved – The effect of skewed continuous predictors in a binary logistic regression model

distributionslogisticodds-ratioskewnesssplines

I am analyzing data with a binary outcome and a variety of continuous and categorical (including dichotomous) predictor variables. My approach is to perform a binary logistic regression and to treat any predictor with more than 20 unique values as continuous. Several arguments against categorization, especially well-documented on Frank Harrell's site are a good reason to not categorize.

However, at a recent meeting where I discussed my analyses approach, a faculty suggested that I would get more accurate risk estimates for the data if I categorized the variables which have a skewed distribution and outliers. Their logic was that the tail of the skewed distribution and the outliers in that tail will have a detrimental effect on the risk estimates generated by the logistic regression and that categorization will address this by erasing the effect of the tail and the skew.

I have several predictor variables that definitely have skewed distributions and some outliers. Is the claim that a variable with a skewed distribution (with outliers) is more likely to produce inaccurate risk estimates compared to the categorized version of the same variable, true? How do skews and outliers in the tail affect logistic regression estimates?

Best Answer

I would get more accurate risk estimates for the data if I categorized the variables which have a skewed distribution and outliers.

That suggestion is not summarily true. It may be true in some cases. It may be detrimental in others. Categorizing a predictor means dividing it into quartiles or user-defined thresholds. Categorized predictors have the advantage of fitting more flexible trend lines to the data. The disadvantage is that they predict the same risk in all groups within a specific category and they borrow no information across adjacent groups. Categorized predictors have the additional disadvantage of increasing the number of predictors and hence the risk of overfitting.

Categorizing a risk predictor introduces the sensitivity to the definition of the thresholds. It can be difficult to prespecify clinically relevant thresholds. Thresholds defined on quartiles of the sample do not tend to generalize well to other validation samples.

Biologically, however, we would be concerned that extreme values of predictors signify biologic trends or interactions that are not captured in the risk model. For instance, blood pressures or BMI several standard deviations above the mean are no longer consistent with the additive effect on risk in intermediate ranges, but reflect exponentially growing risks for diabetes, hypertension, chronic kidney disease, and MI or stroke.

For this reason, we can use rigorous testing or inspection to assess linearity and add supplemental terms if the model fit is dramatically improved. Rather than categorizing predictors, a hybrid alternative approach is to use polynomial terms like the both the linear and log transformed values of a skewed predictor, or we may use piecewise linear, quadratic, or cubic splines to fit trends that are curvilinear but that borrow information across groups and predict non-constant risk in participants with different values.

Related Question