Dear All,
I am currently trying to construct a classification tree for a variable Y using different explanatory variables X. I want to use CART and therefore try to use the function Classification.Tree.fit(Y,X) in MATLAB.
The thing is that my variable Y has two categories, 's' and 'n', where 'n' is very 'rare', meaning only ~5% of data is of this certain class. This means that the majority of the Ys are of the class 's'.
When constructing the tree, I get about 8-10 levels, where the terminal nodes have very few (or not many) predicted observations. Now, let the grown tree be denoted tree, so if I do the following: [~,~,~,bestLevel]=cvLoss(tree,'subtrees','all');
I get that bestLevel is the root (!), meaning every future predicted value would be of just one class… Could it be that my prediction values in X are bad, or am I doing something very wrong here?
I was also wondering: when constructing the initial tree – does the function Classification.Tree.fit() automatically prune the tree to an "optimal size" before returning it, or does it make a big a tree as possible and leaves this to the user to prune afterwards?
Best Answer