Regression – Should Continuous Variables Be Cut to Intervals?

categorical dataclassificationregression

I'm trying to fit a two-class logistic model, using many many features. When inspecting one of the features, I binned it so I could inspect its behavior. In each bin I count the number of 'good class' occurrences, and divide by total number occurrences. I see that in the upper bins, there's a higher probability for a 'good class'.

see image.

For this kind of functional form to enter the model, I would have to add higher-order terms [for instance, by using natural splines]; and this, I'm afraid, would cause my model to over fit.

So, I thought I could 'help' the regression by explicitly dividing the variable into a few different intervals; the knots would be based on so-called 'eye-balling'. Thus, in each interval the variable could have a different linear coefficient [or I could even make it a fixed number by making the variable categorical].

I hope I've made myself clear, and sorry for the non-mathematical explanation – I'm just trying to make my point clear as possible. Thanks for the help!

Best Answer

If you use $n$ bins, you will use $n-1$ degrees of freedom. For the same cost in degrees of freedom, you can fit $n-1$ restricted cubic splines. Thus, there should definitely not be more overfitting if you use splines.

Conversely, discretizing continuous variables is almost always a bad idea, see, e.g., here or here or here. In your case, defining the bins by "eye-balling" will tempt you to change them "just a little bit" to get a better fit... I am not saying you will give in to this temptation, but you will need to account for this leeway in model creation in some way.

Bottom line: don't discretize, use restricted splines.

Related Question