Solved – Solution in case of violation of the linearity assumption in the logistic regression model? (possibly in R)

linearitylogisticrregressionsplines

I have a problem with my logistic regression that I set up and I hope someone can help me.
(I am working with R)

My data is based on hourly values. The dependent variable is a dichotomous variable (1 or 0). The model includes 30 metric independent variables (9 of them have both positive and negative observations).

Now my problem:
One assumption of logistic regression is that there is a linear relationship between the logit of the outcome and each independent metric variable. This assumption is violated in all my models.
(All other assumptions of logistic regression are not violated).

To check this, I applied the Box-Tidwell test several times.
Once with all variables in a logistic regression, where I regressed the original dependent variable on the independent variables and the product of the independent variables with the respective logarithmic transformation of the independent variables.

(y ~ x1 + (x1*ln(x1)) + x2 + (x2*ln(x2)) + ... , familiy = binomial("logit"))

Furthermore I tested the linearity assumption with the R function boxTidwell(model$linear.predictors ~ independent variable) for each variable separately.
For almost all variables, the test showed significance and thus a violation of the model assumption. Several transformations of the independent variables did not help either.
Additionally, my models failed the Hosmer-Lemeshow test.

I know that I can get around the assumption if I transform the metric independent variables to categorical variables. However, I would like to avoid this.
I also read that I can counter the problem with the methodology of splines. Unfortunately I could not find any literature explaining this. Especially not for a logistic model estimation.

Now I would like to know if someone can kindly help me here.

Does a violation of the assumption mean that I am not allowed to use this model and thus the results could be wrong?
(I don't want to use the model as a predictive or forecasting model, but only to explain/describe within the time period of the data.)

How do I apply the methodology of splines to solve my problem? How do I interpret the results?
( It would help me immensely if these explanations were supported by R-codes.)

Best Answer

Several points: