Solved – Number of variables for decision trees

cartmachine learning

I have a data with just 5 independent variables and a response. I am dealing with a classification problem.
Will decision trees perform well or the number of variables have to be higher to get something really useful out of the tree based methods?
What is the appropriate number of variables that may encourage one to use decision trees?

Best Answer

This depends on your classification problem and whether it is easily solvable by a few splits in a few variables or not.There are real classification problems with relatively simple patterns where a couple of splits already yield very good results - but, of course, there are also problems where you need many observations from more variables to get meaningful results. The classic textbook example for the former is the iris dataset. In R using the partykit package you can do:

library("partykit")
ct <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
  data = iris)
ct

which yields

Model formula:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

Fitted party:
[1] root
|   [2] Petal.Length <= 1.9: setosa (n = 50, err = 0.0%)
|   [3] Petal.Length > 1.9
|   |   [4] Petal.Width <= 1.7
|   |   |   [5] Petal.Length <= 4.8: versicolor (n = 46, err = 2.2%)
|   |   |   [6] Petal.Length > 4.8: versicolor (n = 8, err = 50.0%)
|   |   [7] Petal.Width > 1.7: virginica (n = 46, err = 2.2%)

Number of inner nodes:    3
Number of terminal nodes: 4

So here we supply 150 observations of four predictor variables, only two of which are used by the algorithm. The classification performs very well with only the small node [6] having a hard time separating the versicolor and virginica species. Using rpart instead of partykit yields similar results. Use:

library("rpart")
rp <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
  data = iris)
as.party(rp)

Whether or not tree-based methods perform well on your particular data is hard to predict. But you can easily try this out using standard software (e.g., partykit or rpart in R or using the caret interface) and compare the results with other techniques.

Related Solutions

Solved – When to use regression trees/forests

As your figure exemplifies, single decision trees would under perform SVM in most problems. But an ensemble of trees as random forest, is certainly a useful tool to have. Gradient boosting is another great tree derived tool.

SVM and random forest(RF) algorithms are not alike, of course. But both are useful for the same kind of problems and provide similar model structures. Of course the explicitly stated model structures of forest/trees as hierarchical ordered binary step-functions are quite different from SVM regression in a Hilbert space. But if focusing on the actual learned structure of the mapping connecting a feature space with a prediction space the two algorithms produce models with similar predictions. But, when extrapolating outside the feature space region represented with training examples, the "personality" of the model takes over and SVM and RF predictions would strongly disagree. See the example below. That's because both SVM and RF are predictive models, your can see that both SVM and RF did a terrible job extrapolating.

$y = sin(x_1\pi)-.5 {x_2}^2$

So No SVM is not trying to figure out the underlying true equation and certainly not anymore than RF. I disagree with your platonist view-point, expecting real life problems to be governed by some algebra/calculus math, that we humans coincidentally happen to teach each other in high school/uni. Yes in some cases, such simple stated equations are fair approximations of the underlying system. We see that in classic physics and accounting... But that does not mean the equation is the true hidden reality. The "all models are wrong, but some are useful" would be one statement in a conversation going further from here...

I does not matter if you use SVM, RF or any other appropriate estimator. You can always inspect the model structures and perhaps realize the problem can be described by some simple equation or even develop some theory explaining the observations. It becomes a little tricky in high dimensional spaces, but it is possible.

In general rather consider RF over SVM, when:

You have more than 1 million samples
Your features are categorical with many levels(not more than 10 though)
You would like to distribute the training on several computers
Simply when a cross-validated test suggests RF works better then SVM for a given problem.

library(randomForest) #randomForest
library(e1071) #svm
library(forestFloor) #vec.plot and fcol
library(rgl) #plot3d

#generate some data
X = data.frame(replicate(2,(runif(2000)-.5)*6))
y = sin(X[,1]*pi)-.5*X[,2]^2
plot3d(data.frame(X,y),col=fcol(X))

#train a RF model (default params is nearly always, quite OK)
rf=randomForest(X,y)
vec.plot(rf,X,1:2,zoom=3,col=fcol(X))

#train a SVM model (with some resonable params)
sv = svm(X, y,gamma = 1, cost = 50)
vec.plot(sv,X,1:2,zoom=3,col=fcol(X))

Best Answer

Related Solutions

Solved – When to use regression trees/forests

Related Question