Solved – Categorize continuous data effectively (taking into account a response variable)

categorical datamodelingmultinomial-distributionr

I wonder what are the better approaches to categorize continuous data (e.g. age) than dividing them with the use of quantiles and cut function (in R). I have heard about using trees to divide data in the way which takes into consideration how a division would differentiate a response variable, but I cannot find any quick reasonable explanation for that. I want to categorize my data with the aim of using them in multinomial logit model.

Is there any other approach to do it? (Little off-topic: I use R so I would be grateful for some package references or something like this.)

Best Answer

In general, the better approach is not to categorize a continuous variable at all without really good reasons. You are discarding information, as seen by the fact that the categorization cannot be reversed to recover the original. Usually the resulting categorical variable(s) are more difficult to handle in any case than a single continuous variable.

One argument sometimes used for categorization is that measurements may be unreliable, but throwing away information just degrades a variable further.

Specifically, here the motive is stated to be to use a multinomial logit model. You can use age as a continuous predictor in a multinomial logit model, so you presumably want a categorized age to be a response in such a model. The substantive logic is not obvious there either; it is the passage of time, not predictors, makes people (or organisms or organisations) one age rather than another. I can think of examples where age makes sense as a response, e.g. age of prey in ecology, but I'd be surprised at age being a defensible choice of response in most problems. You gave age as an example, but the question applies more broadly: is your chosen response a suitable choice scientifically?

Note that how to do what you ask in R is off-topic here.

Related Question