Solved – Logistic Regression – Dumthe and Numeric variables together

categorical datalogisticregression

I am trying to build a logistic regression model. I have some categorical variables for which I have created dummy variables (eg. Department). I also have some numeric variables like Age and Tenure.

My question is which of the approach should I use-

  1. Should I use a combination of dummy variables and numeric variables
    as an input to my logistic model.
  2. Or, should I create categories of numeric variables based on response rate and use these categories to create dummy variables for numeric variables as well.

In first approach I am afraid that I numeric variable will become highly significant and cause overfitting. Also, they will reduce the real "significance" of dummy variables.
In second approach I am afraid that I will loose a lot of information.

Best Answer

Adding a numberic variable to a logistic regression is unlikely to lead to overfitting as it imposes quite a strong contraint: every unit increase in your numeric variable leads to an increase or decrease in the odds of success of a factor $\exp(\beta)$. By default this factor is constant, which is how you can describe that effect with just one number. You can relax that assumption by adding polynomials, splines, or breaking your numeric variable up into different categories. Overfitting starts to become an issue if you use a polynomial of too high order, too many knots or break your variable up in too many classes.

So if anything your strategy 2 is in danger of loosing too much information if you choose too few categories or overfitting if you choose too many categories.

Related Question