Solved – Transform continuous variables for logistic regression

data transformationlogisticregressionskewness

I have large survey data, a binary outcome variable and many explanatory variables including binary and continuous. I am building model sets (experimenting with both GLM and mixed GLM) and using information theoretic approaches to select the top model. I carefully examined the explanatories (both continuous and categorical) for correlations and I am only using those in the same model that have a Pearson or Phicorr coeff less than 0.3.
I would like to give all of my continuous variables a fair chance in competing for the top model. In my experience, transforming those that need it based on skew improves the model they participate in (lower AIC).

My first question is: is this improvement because transformation improves the linearity with the logit? Or is correcting skew improves the balance of the explanatory variables somehow by making the data more symmetric? I wish I understood the mathematical reasons behind this but for now, if someone could explain this in easy terms, that would be great. If you have any references I could use, I would really appreciate it.

Many internet sites say that because normality is not an assumption in binary logistic regression, do not transform the variables. But I feel that by not transforming my variables I leave some at disadvantage compared to others and it might affect what the top model is and changes the inference (well, it usually does not, but in some datasets it does). Some of my variables perform better when log transformed, some when squared (different direction of skew) and some untransformed.

Would someone be able to give me a guideline what to be careful about when transforming explanatory variables for logistic regression and if not to do it, why not?

Best Answer

You should be wary of decide about transforming or not the variables just on statistical grounds. You must look on interpretation. ¿Is it reasonable that your responses is linear in $x$? or is it more probably linear in $\log(x)$? And to discuss that, we need to know your varaibles... Just as an example: independent of model fit, I wouldn't believe mortality to be a linear function of age!

Since you say you have "large data", you could look into splines, to let the data speak about transformations ... for instance, package mgcv in R. But even using such technology (or other methodsto search for transformations automatically), the ultimate test is to ask yourselves what makes scientific sense. ¿What do other people in your field do with similar data?