Machine Learning – Justifying Unsupervised Discretization of Continuous Variables

binningcategorical datageneralized linear modelmachine learning

A number of sources suggest that there are many negative consequences of the discretization (categorization) of continuous variables prior to statistical analysis (sample of references [1]-[4] below).

Conversely [5] suggests that some machine learning techniques are known to produce better results when continuous variables are discretized (also noting that supervised discretization methods perform better).

I am curious if there are any widely accepted benefits or justifications for this practice from a statistical perspective?

In particular, would there
be any justification for discretizing continuous variables within a GLM analysis?


[1] Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006;25:127-41


[2] Brunner J, Austin PC. Inflation of Type I error rate in multiple regression when independent variables are measured with error. The Canadian Journal of Statistics 2009; 37(1):33-46


[3] Irwin JR, McClelland GH. Negative consequences of dichotomizing continuous predictor variables. Journal of Marketing Research 2003; 40:366–371.


[4] Harrell Jr FE. Problems caused by categorizing continuous variables. http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/CatContinuous, 2004. Accessed on 6.9.2004

[5] Kotsiantis, S.; Kanellopoulos, D. "Discretization Techniques: A recent survey". GESTS International Transactions on Computer Science and Engineering 32(1):47–58.

Best Answer

The purpose of statistical models is to model (approximate) an unknown, underlying reality. When you discretize something that is naturally continuous, you are saying that all the responses for a range of predictor variables are exactly the same, then there is a sudden jump for the next interval. Do you really believe that the natural world works by having a large difference in the response between x-values of 9.999 and 10.001 while having no difference between 9.001 and 9.999 (assuming one of the intervals is 9-10)? I cannot think of any natural processes that I would consider plausibly working that way.

Now there are many natural processes that act in a non linear manner, the change from 8 to 9 in the predictor may make a very different change in the response than a change from 10 to 11. And therefore a discretized predictor may fit better than a linear relationship, but that is because it is allowed more degrees of freedom. But, there are other ways to allow additional degrees of freedom, such as polynomials or splines, and these options allow us to penalize to get a certain level of smoothness and maintain something that is a better approximation of the underlying natural process.

Related Question