In fact there are two types of factors -- ordered (like Tiny < Small < Medium < Big < Huge) and unordered (Cucumber, Carrot, Fennel, Aubergine).
First class is the same as continuous ones -- there is only easier to check all pivots, there is also no problem with extending levels list.
For the second class, you have to make a set of elements that will be directed in one branch, leaving the rest into the other -- in this case you can either:
- throw error
- assume that unseen class goes into your favorite branch
- treat this as NA and select branch in more-less random way.
Now, direct treating of unordered factors is tricky so many algorithms "cheat" and claim unordered factors as ordered ones, so they don't even touch this problem*. The rest usually use int mask, i.e. optimize an integer number from $1$ to $2^{\text{#categories}-1}-1$ and treat $i$-th bit as a branch selection for a factor level $i$. (Ever wondered why there is frequent limit to 32 levels?) In this setting, it is quite natural that unseen levels go silently into "0" branch. Yet this does not seem too "right", because why really should we do this? And what about the number of levels required to entropy leverage in attribute selection?
I would say the the most sensible idea is to make the user define the full set of factors (for instance R does this organically, preserving levels through subset operations) and use option 1. for not-declared levels and option 2. for declared ones. Option 3. may make sense if you already have some NA handling infrastructure.
*) There is also side strategy to do some non-trivial re-coding of levels into numbers, like for instance Breiman encoding -- yet this generates even more problems.
In practice I have seen both approaches taken, and I think that generally your results would not be expected to differ much either way.
That being said, Hastie et al recommend the "one-standard error" rule in the Elements of Statistical Learning, and I tend to trust their judgment (Section 7.10, pg. 244 in my version). The relevant quote is:
Often a "one-standard error" rule is used with cross-validation, in which we choose the most parsimonious model whose error is no more than one standard error above the error of the best model."
Your intuition for why one would follow the one-standard error rule is right - you would do that to avoid selecting a model that overfits the data.
Best Answer
If you re-run just
plotcp
you should get the same exact plot. But if you re-runrpart
you will get different fits because randomization is involved. You can avoid this by setting a seed before each run ofrpart
.e.g.
will yield two different trees and thus two different fits, and the plot will reflect that. But
should be identical, as should,