Solved – When can a continuous variable be treated as categorical

categorical dataclassificationcontinuous dataregressionskewness

I have a continuous variable, which can take any value between 0 and some large, though not infinite, number. Let us assume that the maximum possible value is 1000.

The values are nowhere near uniformly distributed in 2 ways. First, the distribution of values skews right (most values are on the smaller end of the 0-1000 scale, with few larger than 100).

Second, the values cluster around multiples of 5 (mostly from 5 to 50 given the right skew). This is not due to missing data, but rather to the way in which these data are naturally generated.

My question for CrossValidated is the following. If I want to build a predictive model to predict values for new observations, then should I (1) use regression because the data are continuous or (2) rely on classification because the data clump tightly around a few values?

In other words, at what point does a continuous variable become, effectively, categorical? I do not give specifics about the data because I'm curious to learn if a general intuition exists about how such sparse, clustered continuous data can be treated.

Best Answer

When can you do this ... is whenever you like and some researchers are fond to trying to model complicated relationships of responses with continuous predictors by splitting the predictor ranges into bins or intervals and thus converting them into categorical predictors. Discussion on this site alone is usually insistent on the difficulties and even dangers of this approach. See e.g. What is the benefit of breaking up a continuous predictor variable? -- in which the title of the question does not indicate all the answers.

When should you do this ... is the larger question and one answer is thus reluctantly and as rarely as possible.

Rounding (in your case to multiples of 5 when the range is from 0 to 1000) as such is usually only a little worrying. In many fields some rounding when reporting continuous variables is conventional, at least historically, and often grows out of general scientific awareness that high resolution of reporting (more decimal places, or more significant figures) is spurious or only trivially informative. Thus in many fields adult age rounded in years, or people's heights rounded in cm or inches, or temperatures rounded to $0.1^\circ$ C are standard even when more precision is possible, and such rounding has not stood in the way of much good statistically-based science. The important detail is whether several distinct levels are discernible in the data. Thus it is natural scientifically and statistically to expect age to be measured in months when looking at growth of children, but in days or even hours when monitoring small babies.

Strong bimodality to the extent that you have almost what Edgeworth and Yule called U-shaped distributions is worrying, but the question remains what you do when you have it. Reducing a U-shaped distributions to extreme categories would however be throwing away information on the grounds that you don't have enough, rather like a poor person giving away all their money on the grounds that they have so little any way (the moral or religious grounds for the latter being outside the scope of this forum).

So the key is that just because a continuous variable is well represented only in terms of a few levels doesn't oblige you to treat it as categorical.

This question is often mixed up with a different one, whether a continuous predictor be entered into a model as it is measured, or via a transformation or via a representation in simple polynomials, splines, orthogonal polynomials or fractional polynomials. What can bite is that with just two extremes well represented it may be quite unclear how that variable's contribution is best measured. As usual, a modeller falls back on linearity in the absence of other information, but sometimes there may be compelling physical (biological, economic, whatever) grounds for something different.

At the extreme this might drive some researchers to a categorical representation, but it seems rare that there is no information on what should happen in the middle of the range. Thus I have encountered British hydrological data where drainage basins (catchments, watersheds) of two different orders of area (in square kilometres) were encountered, very small basins instrumented intensively by university researchers and much larger basins instrumented by the national organisations who manage flooding and water generally. Here the absence of intermediate areas is just a side-effect of how the data were assembled, and there is prior scientific knowledge from many studies that area should almost always be used via its logarithm (that is, it makes no sense to categorise areas, say as small, medium or large).

Skewness is key in this example and complicates matters here and elsewhere, but I think its effects are extra rather than in contradiction to the above.