Solved – Decision tree with imbalanced data not affected by pruning

cartmachine learningrrpart

I am looking to use a Decision Tree to classify whether or not a car will sell based on attributes of that car. The attributes that I have include price, year, mileage, condition (new, pre-owned, or used), number of cylinders (4, 6, 8), transmission_type (manual, auto, other). I have complete data for nearly 1,500 cars, of which 116 had sold.

I have followed many tutorials, and the process I am following is as such:

Randomly partition data into 70% train / 30% validation
Upsample training dataset to eliminate imblanace in selling status
Grow out complete decision tree with training data
Determine where to cut the decision tree based on minimum CP
Prune decision tree based on minimum CP
Fit pruned tree to test data
Evaluate fit based on confusion matrix

The problem that I'm experiencing is that the pruned tree really doesn't look much different from the complete tree. Also, the model doesn't do very well in correctly classifying observations in the minority group.

My question is whether my attempt at pruning is really doing anything? What am I missing in this process? I know there are other ways to fix the imbalanced training data, but I'm not sure if that's the problem, or if there is something else causing the issue.

If you are interested in looking at the data, I have made it available at he following URL: http://pastebin.com/qJkCmR6x

In addition, I have included my code below for your review. Please me know if if you have any thoughts on how I could improve the minority classification in this situation.

library(rpart)
library(caret)

# read CSV data into df
df <- read.csv("data.csv")

# set variables type accordingly
df$price <-  as.numeric(df$price)
df$year  <- as.ordered(factor(df$year))
df$condition <- factor(df$condition, levels=c("Used", "Certified pre-owned", "New"), ordered=TRUE)
df$numberofcylinders  <- as.ordered(factor(df$numberofcylinders))
df$transmission_type <- as.factor(df$transmission_type)
df$status <- as.factor(df$status)


## Create training and test data
# figure out 70% sample size
smp_size <- floor(0.70 * nrow(df))

# partition data into train and test
set.seed(123)
train_ind <- sample(seq_len(nrow(df)), size = smp_size)
train <- df[train_ind, ]
test <- df[-train_ind, ]

# upsample training data for equal proportions of 1 and 0
up_train <- upSample(x = train[, -ncol(train)],
                     y = train$status)

## Fit Decision Tree
# grow tree out completely
fit <-rpart(Class ~ price + year + mileage + condition + numberofcylinders + transmission_type,                         
            data = up_train,                   
            method = "class",                     
            parms = list(split = 'information'),
            maxsurrogate = 0,                     
            cp = 0,                              
            minsplit = 5,                                                             
            minbucket = 2,
            xval = 10)

# plot tree
plot(fit, uniform=TRUE, main="Decision Tree to Predict If Car Sold")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

# display the results
printcp(fit)

# detailied summany of splits
summary(fit)

# visualize cross validation results
plotcp(fit)

# determine where to cut the tree
fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]

# prune the tree to prevent overfitting
pfit<- prune(fit, cp = fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])


# show results of pruned tree
summary(pfit)

# plot pruned results
plot(fit, uniform=TRUE, main="Pruned Decision Tree to Predict If Car Sold")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

# fit pruned tree to test data
pred = predict(pfit, test, type="class")

# the breakdown of the actual status in the test data
table(test$status)

  0   1 
413  36 

# review predicted vs actual
table(pred, test$status)

pred   0   1
   0 392  30
   1  21   6

# calcaulate accuracy
sum(test$status==pred)/length(pred)

0.88641425389755

Best Answer

Old thread, but I think the "problem" is that your tree is very good, so pruning has no effect. Pruning means to eliminate leaves that are not increasing the accuracy significantly, i.e. to prevent overfitting. In your plotcp you are plotting the out-of-sample error of the model, computed by rpart through cross-validation: pruning simply removes leaves until you reach the minimum in that plot. The minimum is at cp=0, so there is nothing to prune.

Related Solutions

Decision Tree – How to Train a Decision Tree Against Unbalanced Data?

This is an interesting and very frequent problem in classification - not just in decision trees but in virtually all classification algorithms.

As you found empirically, a training set consisting of different numbers of representatives from either class may result in a classifier that is biased towards the majority class. When applied to a test set that is similarly imbalanced, this classifier yields an optimistic accuracy estimate. In an extreme case, the classifier might assign every single test case to the majority class, thereby achieving an accuracy equal to the proportion of test cases belonging to the majority class. This is a well-known phenomenon in binary classification (and it extends naturally to multi-class settings).

This is an important issue, because an imbalanced dataset may lead to inflated performance estimates. This in turn may lead to false conclusions about the significance with which the algorithm has performed better than chance.

The machine-learning literature on this topic has essentially developed three solution strategies.

You can restore balance on the training set by undersampling the large class or by oversampling the small class, to prevent bias from arising in the first place.
Alternatively, you can modify the costs of misclassification, as noted in a previous response, again to prevent bias.
An additional safeguard is to replace the accuracy by the so-called balanced accuracy. It is defined as the arithmetic mean of the class-specific accuracies, $\phi := \frac{1}{2}\left(\pi^+ + \pi^-\right),$ where $\pi^+$ and $\pi^-$ represent the accuracy obtained on positive and negative examples, respectively. If the classifier performs equally well on either class, this term reduces to the conventional accuracy (i.e., the number of correct predictions divided by the total number of predictions). In contrast, if the conventional accuracy is above chance only because the classifier takes advantage of an imbalanced test set, then the balanced accuracy, as appropriate, will drop to chance (see sketch below).

Accuracy vs. balanced accuracy

I would recommend to consider at least two of the above approaches in conjunction. For example, you could oversample your minority class to prevent your classifier from acquiring a bias in favour the majority class. Following this, when evaluating the performance of your classifier, you could replace the accuracy by the balanced accuracy. The two approaches are complementary. When applied together, they should help you both prevent your original problem and avoid false conclusions following from it.

I would be happy to post some additional references to the literature if you would like to follow up on this.

Decision Trees – Evaluating Decision Tree Models for Training Set vs Testing Set in R

Try this for class predictions:

pred <- predict(model_ctree, newdata=test)
library(caret)
confusionMatrix(pred, test$response)

Try this for class probabilities:

probs <- treeresponse(model_ctree, newdata=test)
pred <- do.call(rbind, pred)
summary(pred)

Try this for a roc curve:

library(ROCR)
roc_pred <- prediction(pred[,1], test$response)
plot(performance(roc_pred, measure="tpr", x.measure="fpr"), colorize=TRUE)

Try this for a lift curve:

plot(performance(roc_pred, measure="lift", x.measure="rpp"), colorize=TRUE)

Sensitivity/specificity curve and precision/recall curve:

plot(performance(roc_pred, measure="sens", x.measure="spec"), colorize=TRUE)
plot(performance(roc_pred, measure="prec", x.measure="rec"), colorize=TRUE)

More info:

?ctree
?confusionMatrix
?performance

Also, you should check out the caret package if you're building predictive models in R. It implements a number of out-of-sample evaluation schemes, including bootstrap sampling, cross-validation, and multiple train/test splits. caret is really nice because it provides a unified interface to all the models, so you don't have to remember, e.g., that treeresponse is the function to get class probabilities from a ctree model. Here's an example of using 10-fold cross-validation to evaluation your model, which is much better than a single train/test split:

model <- train(response ~ x1 + .. xn , data = train, method='ctree', tuneLength=10,
               trControl=trainControl(
                 method='cv', number=10, classProbs=TRUE, summaryFunction=twoClassSummary))
model
plot(model)

Best Answer

Related Solutions

Decision Tree – How to Train a Decision Tree Against Unbalanced Data?

Decision Trees – Evaluating Decision Tree Models for Training Set vs Testing Set in R

Related Question