Solved – Categorize continuous data effectively (taking into account a response variable)

categorical datamodelingmultinomial-distributionr

I wonder what are the better approaches to categorize continuous data (e.g. age) than dividing them with the use of quantiles and cut function (in R). I have heard about using trees to divide data in the way which takes into consideration how a division would differentiate a response variable, but I cannot find any quick reasonable explanation for that. I want to categorize my data with the aim of using them in multinomial logit model.

Is there any other approach to do it? (Little off-topic: I use R so I would be grateful for some package references or something like this.)

Best Answer

In general, the better approach is not to categorize a continuous variable at all without really good reasons. You are discarding information, as seen by the fact that the categorization cannot be reversed to recover the original. Usually the resulting categorical variable(s) are more difficult to handle in any case than a single continuous variable.

One argument sometimes used for categorization is that measurements may be unreliable, but throwing away information just degrades a variable further.

Specifically, here the motive is stated to be to use a multinomial logit model. You can use age as a continuous predictor in a multinomial logit model, so you presumably want a categorized age to be a response in such a model. The substantive logic is not obvious there either; it is the passage of time, not predictors, makes people (or organisms or organisations) one age rather than another. I can think of examples where age makes sense as a response, e.g. age of prey in ecology, but I'd be surprised at age being a defensible choice of response in most problems. You gave age as an example, but the question applies more broadly: is your chosen response a suitable choice scientifically?

Note that how to do what you ask in R is off-topic here.

Related Solutions

Solved – Categorical response variable prediction

You could use ANY classifier. Including Linear Discriminants, multinomial logit as Bill pointed out, Support Vector Machines, Neural Nets, CART, random forest, C5 trees, there are a world of different models that can help you predict $v.a$ using $v.b$ and $v.c$. Here is an example using the R implementation of random forest:

# packages
library(randomForest)

#variables
v.a= c('cat','dog','dog','goat','cat','goat','dog','dog')
v.b= c(1,2,1,2,1,2,1,2)
v.c= c('blue', 'red', 'blue', 'red', 'red', 'blue', 'yellow', 'yellow')

# model fit
# note that you must turn the ordinal variables into factor or R wont use
# them properly
model <- randomForest(y=as.factor(v.a),x=cbind(v.b,as.factor(v.c)),ntree=10)

#plot of model accuracy by class
plot(model)

enter image description here

# model confusion matrix
model$confusion

Clearly these variables don't show a strong relation.

Solved – Test for Complete Spatial Randomness taking into account background distribution

One approach I have done in the past in the spatstat package is to create a set of simulations from the universe based on sampling with replacement (my work a point can happen repeatedly at the same location, e.g. crimes at an address) from that universe point pattern. Then you can use these samples as a reference distribution for whatever test you are interested in.

Here is a function to create those sub-samples (simply change the sub_universe line to sample without replacement if that is how you want the simulations to be drawn). (I wrote this 3 years ago it appears, and I'm sure it can be improved for computation time.)

#My awful function to generate simulation envelopes of a spatstat object given the universe
ppp_lists <- function(universe_x, universe_y, sub_ppp, nlist) {
require(spatstat)
myppp_list <- c() #make empty list
universe_xy <- data.frame(x = universe_x, y = universe_y) #make dataframe of X & Y objects to sample from
sampsize <- sub_ppp$n
    for (i in 1:nlist) {
             sub_universe <- universe_xy[sample(nrow(universe_xy),size=sampsize,replace = TRUE),] #sampling with replacement from that dataframe.
             current_ppp <-  ppp(sub_universe$x, sub_universe$y, window =  sub_ppp$window)   #making that into a ppp object 
                                                                                         #(taking window from subsample ppp object)
         myppp_list[[i]] <- current_ppp                                                  #appending that object to a list
}
         return(myppp_list)
}

Now, with that function generates a list that can be supplied to the envelope function as the simulation bands. Here is an example of passing the list using the simulate argument to mad.test:

#Now making simulation evelopes based on universe (warnings are for duplicate points)
SimEvel <- ppp_lists(universe_x = bg$x, universe_y = bg$y, sub_ppp = subset, nlist = 99)

#Now can use the user supplied envelopes for the mad.test
mytest <- mad.test(subset, simulate=SimEvel)
mytest

Any test that uses calculations for the density will be off by some constant here (from a set of finite points it is not 100% clear to me how you should define area). But the simulation envelopes should be fine for hypothesis testing.

There are other functions in the spatstat package for binary marked events like disease infection, but those aren't directly applicable here.

Another approach might be to turn the window into a raster image where the valid locations are only the very small defined pixels where points can occur. Then all the usual functions that take a window will work (I am not very familiar with mad.test, so I can't say if it will be applicable for this test). The various tests in the package will become more tedious for the more points, but generating the simulations shouldn't be too expensive.

Best Answer

Related Solutions

Solved – Categorical response variable prediction

Solved – Test for Complete Spatial Randomness taking into account background distribution

Related Question