In fact there are two types of factors -- ordered (like Tiny < Small < Medium < Big < Huge) and unordered (Cucumber, Carrot, Fennel, Aubergine).
First class is the same as continuous ones -- there is only easier to check all pivots, there is also no problem with extending levels list.
For the second class, you have to make a set of elements that will be directed in one branch, leaving the rest into the other -- in this case you can either:
- throw error
- assume that unseen class goes into your favorite branch
- treat this as NA and select branch in more-less random way.
Now, direct treating of unordered factors is tricky so many algorithms "cheat" and claim unordered factors as ordered ones, so they don't even touch this problem*. The rest usually use int mask, i.e. optimize an integer number from $1$ to $2^{\text{#categories}-1}-1$ and treat $i$-th bit as a branch selection for a factor level $i$. (Ever wondered why there is frequent limit to 32 levels?) In this setting, it is quite natural that unseen levels go silently into "0" branch. Yet this does not seem too "right", because why really should we do this? And what about the number of levels required to entropy leverage in attribute selection?
I would say the the most sensible idea is to make the user define the full set of factors (for instance R does this organically, preserving levels through subset operations) and use option 1. for not-declared levels and option 2. for declared ones. Option 3. may make sense if you already have some NA handling infrastructure.
*) There is also side strategy to do some non-trivial re-coding of levels into numbers, like for instance Breiman encoding -- yet this generates even more problems.
Best Answer
Have you checked the package
party
? I believe the functionctree
handles censored data.