Rpart Prediction – How to Use Rpart’s Result for Classification

classificationrpart

This may be a simple question but I stuck in this problem. I am using Recursive Partitioning (rpart) package in R for building a classification tree. I generated a tree from a sample data (for testing rpart). I fit the sample data using rpart's formula

   fit =  rpart(formula, data=, method=,control=)

This gave me the classification tree. I can see the summary, plot the tree, plot the result. But my question is how can I use its result for prediction? I want to supply a input data to the tree and I want the algorithm to give the correct classification for the input. But I think I have nothing to do with the tree unless I can predict. I may be interpreting the result in a wrong way. Please make me clear about this.

Best Answer

fit = rpart(formula, data =, method =, control=)
fitVariablesUsed <- names(fit[,1:20])
preds <- predict(fit, data = newdata[,c(fitVariablesUsed)], type = c("prob"))

this will return a probability matrix for each of the observations. Meaning it will give a probability that the observation is in class 1, class2, etc.

make sure that the columns all line up correctly between the matrix you made the model with and the matrix you're going to make the predictions with.

The variables I created fitVaraiblesUsed which connected 20 variables (just for example) from the fit data frame can then be used in the new data data frame, so long as they're all named the same thing.

Related Solutions

Solved – How to measure/rank “variable importance” when using CART? (specifically using {rpart} from R)

Variable importance might generally be computed based on the corresponding reduction of predictive accuracy when the predictor of interest is removed (with a permutation technique, like in Random Forest) or some measure of decrease of node impurity, but see (1) for an overview of available methods. An obvious alternative to CART is RF of course (randomForest, but see also party). With RF, the Gini importance index is defined as the averaged Gini decrease in node impurities over all trees in the forest (it follows from the fact that the Gini impurity index for a given parent node is larger than the value of that measure for its two daughter nodes, see e.g. (2)).

I know that Carolin Strobl and coll. have contributed a lot of simulation and experimental studies on (conditional) variable importance in RFs and CARTs (e.g., (3-4), but there are many other ones, or her thesis, Statistical Issues in Machine Learning – Towards Reliable Split Selection and Variable Importance Measures).

To my knowledge, the caret package (5) only considers a loss function for the regression case (i.e., mean squared error). Maybe it will be added in the near future (anyway, an example with a classification case by k-NN is available in the on-line help for dotPlot).

However, Noel M O'Boyle seems to have some R code for Variable importance in CART.

References

Sandri and Zuccolotto. A bias correction algorithm for the Gini variable importance measure in classification trees. 2008
Izenman. Modern Multivariate Statistical Techniques. Springer 2008
Strobl, Hothorn, and Zeilis. Party on!. R Journal 2009 1/2
Strobl, Boulesteix, Kneib, Augustin, and Zeilis. Conditional variable importance for random forests. BMC Bioinformatics 2008, 9:307
Kuhn. Building Predictive Models in R Using the caret Package. JSS 2008 28(5)

R – How to Use Recursive Partitioning with rpart() Method in R

Perhaps you misunderstood the message? It is saying that, having built the tree using the control parameters specified, only the variables mpa_a and tc_b have been involved in splits. All the variables were considered, but just these two were needed.

That tree seems quite small; do you have only a small sample of observations? If you want to grow a bigger tree for subsequent pruning back, then you need to alter the minsplit and minbucket control parameters. See ?rpart.control, e.g.:

rm <- rpart(uloss ~ tc_b + ublkb + mpa_a + mpa_b + 
            sys_a + sys_b + usr_a, data = data81, method = "anova",
            control = rpart.control(minsplit = 2, minbucket = 1))

would try to fit a full tree --- but it will be hopelessly over-fitted to the data and you must prune it back using prune(). However, that might assure you that rpart() used all the data.

Best Answer

Related Solutions

Solved – How to measure/rank “variable importance” when using CART? (specifically using {rpart} from R)

R – How to Use Recursive Partitioning with rpart() Method in R

Related Question