Solved – Number of variables for decision trees

cartmachine learning

I have a data with just 5 independent variables and a response. I am dealing with a classification problem.
Will decision trees perform well or the number of variables have to be higher to get something really useful out of the tree based methods?
What is the appropriate number of variables that may encourage one to use decision trees?

Best Answer

This depends on your classification problem and whether it is easily solvable by a few splits in a few variables or not.There are real classification problems with relatively simple patterns where a couple of splits already yield very good results - but, of course, there are also problems where you need many observations from more variables to get meaningful results. The classic textbook example for the former is the iris dataset. In R using the partykit package you can do:

library("partykit")
ct <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
  data = iris)
ct

which yields

Model formula:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

Fitted party:
[1] root
|   [2] Petal.Length <= 1.9: setosa (n = 50, err = 0.0%)
|   [3] Petal.Length > 1.9
|   |   [4] Petal.Width <= 1.7
|   |   |   [5] Petal.Length <= 4.8: versicolor (n = 46, err = 2.2%)
|   |   |   [6] Petal.Length > 4.8: versicolor (n = 8, err = 50.0%)
|   |   [7] Petal.Width > 1.7: virginica (n = 46, err = 2.2%)

Number of inner nodes:    3
Number of terminal nodes: 4

So here we supply 150 observations of four predictor variables, only two of which are used by the algorithm. The classification performs very well with only the small node [6] having a hard time separating the versicolor and virginica species. Using rpart instead of partykit yields similar results. Use:

library("rpart")
rp <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
  data = iris)
as.party(rp)

Whether or not tree-based methods perform well on your particular data is hard to predict. But you can easily try this out using standard software (e.g., partykit or rpart in R or using the caret interface) and compare the results with other techniques.

Related Question