Solved – Which regression tree to use for large data

cartlarge datarregression

I have a dataframe with 2 million rows and approximately 200 columns / features. Approximately 30-40% of the entries are blank. I am trying to find important features for a binary response variable. The predictors may be categorical or continuous.

I started with applying logistic regression, but having so much missing entries I feel that this is not a good approach as glm discard all records which have any item blank. So I am now looking to apply tree based algorithms (rpart or gbm) which are capable to handle missing data in a better way.

Since my data is too big for rpart or gbm, I decided to randomly fetch 10,000 records from original data, apply rpart on that, and keep building a pool of important variables. However, even this 10,000 records seem to be too much for the rpart algorithm.

What can I do in this situation? Is there any switch that I can use to make it fast? Or it is impossible to apply rpart on my data.

I am using the following rpart command:

varimp = rpart(fmla,  dat=tmpData, method = "class")$variable.importance

Best Answer

Simple answer:

Why don't you just simply drop the features which have a large amount of missing data from your analysis? They give you little information and are unlikely to be useful predictors. However if you start dropping the rows with missing data that might start introducing bias in your results. This highlights the more fundamental problem of dealing with missing data.

More involved answer:

Missing data is a well established problem in statistics and there are techniques which deal with this issue. One way you could go about addressing the missing data problem is to attempt to impute i.e guess the missing data. Once you've done that you would then be able to run a logistic regression. To factor in the uncertainty you introduce when guessing the missing data you could generate multiple imputed data sets and use the R package mitools when doing the logistic regression. I have used this approach in my data. However how you go about doing the actual imputation/guessing depends on the properties of your dataset. Are certain features correlated for example?

Regarding your question about the performance of the classification regression tree method, I have not used this technique extensively but I imagine it is going to struggle with a large number of features since from my understanding it would construct a classification tree of at least 200 nodes. No idea how it deals with missing data either. It is a bit alarming if it doesn't complain!

I think logistic regression is your best bet but you need to figure out how to deal with the missing data.

Take home message: be wary about missing data don't run methods without knowing how they deal with missing data and what assumptions they make.

Related Question