Solved – How to use estimated probabilities of a class from rpart to identify the top N classes

cartclassificationrrpart

Using the rpart library, I'm trying to predict which class each observation belongs to. Here is a reproducible example explaining the steps I am taking:

library(rpart)

# training set
df_train <- data.frame(
  tag = c('123', '123', '124', '124', '125'),
  p1 = c('home', 'work', 'work', 'work', 'home'),
  p2 = c(1, 1, 1, 0, 1)
)

# testing set
df_test <- data.frame(
  tag = c('123', '124', '125'),
  p1 = c('home', 'work', 'home'),
  p2 = c(1, 1, 0)
)

# train model
model.rpart = rpart(tag~p1+p2, data=df_train, method="class")

# predict probabilities of class
pred.rpart = predict(model.rpart, data=df_test, method="prob")

# list out results
pred.rpart

My problem is that I don't fully understand the output of the table pred.rpart

> pred.rpart
  123 124 125
1 0.4 0.4 0.2
2 0.4 0.4 0.2
3 0.4 0.4 0.2
4 0.4 0.4 0.2
5 0.4 0.4 0.2

I thought it was giving me a list of probabilities for each class in my test dataset, but I don't understand why there are five rows, when I am just trying to look at the predictions of the test data set.

Why does pred.rpart contain five rows of data?

My overall objective is to find the top N predictions for a class. So for the first observation in my df_test dataframe, I would like to be able to say:

Top 2 predictions for the first observation:
  #1: '123': 40%
  #2: '124': 40% 

Once I understand the output of rpart.pred I want to summarize this using the following command to give me each class prediction, ordered by probability:

n_classes <- 2
apply(pred.rpart,1,function(xx)head(names(sort(xx, decreasing=T)), n_classes))

Best Answer

What you have is almost correct. You are getting five rows because you are getting the predictions for df_train. Note that you get the same answer if you simply omit data=df_test. The problem is that you wrote data=df_test, but predict wants the argument newdata=df_test. Take a look at the help page ?predict.rpart or even just args(rpart:::predict.rpart).

Related Question