Solved – Understanding RPART model results

classificationrrpart

I have operational fault data and maintenance data. The operational fault data was used to determine if the maintenance improved the fault indicator (true/false). The maintenance data was used to identify what maintenance actions were performed. RPART was used to generate a model, with the maintenance actions as independent variables and operational fault reduction as the categorical output data (true/false). 0.5 was subtracted from the operational fault data so the values were -0.5, 0.5 instead of 0, 1.

I don't understand how to interpret the meaning of the plot of the rtree model. How to determine, or indicate, which of the bottom nodes correspond to true or false? Also, what do the colors indicate.

R commands

subdata <- data.frame(x="maintenance actions", y="Fault improved"-0.5)
rtreeFit <- rpart(y ~ .,data=subdata)
fancyRpartPlot(rtreeFit,main=paste('RPART:'),sub=cName)

Is it possible to draw a histogram for each leaf showing the distribution of classifications?

enter image description here

Here's the updated code

y_subdata = factor(y_training[rowIndx])
x_subdata = x_training[rowIndx, ]
subdata<-data.frame(x=x_subdata,y=y_subdata)
fit <- rpart(y ~ .,method='class',data=subdata,
             control=rpart.control(minsplit=3,cp=0.0001))

The numbers are hard to read, but what do the numbers mean?

enter image description here

Best Answer

One thing that concerns me is the way that at least 2 of the variables show up at multiple nodes. I have run a lot of Recursive partitioning using RPART and have come to recognize multiple nodes with the same variable as a sign that the tree may be unreliable (e.g., nodes 1, 3, and 11 are both "x.sum_manhours"). I am not sure why you subtracted 0.5 from your operational fault outcome variable. It seems like this was an attempt to center the data but your outcome is a categorical or factor variable so centering means nothing. By subtracting 0.5 your program may have treated your outcome variable as continuous which would mean that your RPART procedure created a regression tree (continuous outcome) instead of a classification tree (categorical outcome). Finally, there are bootstrapping techniques for checking the stability of your tree that you might consider.