Categorical Data – Best Practices for Coding Features for Decision Trees

boostingcartcategorical datarandom forest

When coding categorical features for linear regression, there is a rule: number of dummies should be one less than the total number of levels (to avoid collinearity).

Does there exist a similar rule for Decision Trees (bagged, boosted)? I am asking this because a standard practice in Python seems to be to expand n levels into n dummies (sklearns' OneHotEncoder or Pandas' pd.get_dummies ) which appears suboptimal to me.

What would you suggest as best practices for coding Categorical features for Decision Trees?

Best Answer

It seems like you understand that you're able to have n levels, as opposed to n-1, because unlike in linear regression you don't need to worry about perfect colinearity.

(I'm coming at this from an R perspective, but I assume it's the same in Python.) That depends on a couple of things, such as 1) which package you're using and 2) how many factor levels you have.

1) If you are using R's randomForest package, then if you have <33 factor levels then you can go ahead and leave them in one feature if you want. That's because in R's random forest implementation, it will check to see which factor levels should be on one side of the split and which on the other (e.g., 5 of your levels might be grouped together on the left side, and 7 might be grouped together on the right). If you split the categorical feature out into n dummies, then the algorithm would not have this option at its disposal.

Obviously if the particularly package you're using can't handle categorical features then you'd just need to create n dummy variables.

2) As I alluded to above, R's random forest implementation can only handle 32 factor levels - if you have more than that then you either need to split your factors into smaller subsets, or create a dummy variable for each level.

Related Solutions

Solved – Classification and regression trees (cart)

It looks to me like classregtree is just building a tree, not using any of these methods, all of which are supplementary to tree building. That is, classregtree is implementing the methods described in Breiman et al., per the reference given in the documentation. It builds a tree and then (by default) prunes it.

Regression Trees – Pooling Levels of Categorical Variables for Effective Regression Tree Construction

I have implemented my solution to this. I wrote two functions:

prox_matrix(df, target, features, cluster_dimension,trees = 10)

Parameters

df: Input dataframe
target: Dependant variable you are trying to predict with the random forrest
features: List of independent variables
cluster_dimension: Dimension you would like to cluster/pool to add to your list of features
trees: the number of trees to use in your Random Forest

Returns

D: DataFrame of the proximity matrix for the cluster_dimension

Code Below

def prox_matrix(df, target, features, cluster_dimension,trees = 10):
    #https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prox

    from sklearn.ensemble import RandomForestRegressor
    import numpy as np
    import pandas as pd

    #initialize datframe for independant variables
    independant = pd.DataFrame()

    #Handle Categoricals: This should really be added to RandomForestRegressor
    for column,data_type in df[features].dtypes.iteritems():       
        try:
            independant[column] = pd.to_numeric(df[column],downcast = 'integer')
        except ValueError:
            contains_nulls = df[column].isnull().values.any()
            dummies = pd.get_dummies(df[column],prefix=column,dummy_na=contains_nulls,drop_first=True)
            independant[dummies.columns] = dummies

    if len(independant.index) != len(df.index):
        raise Exception('independant variables not stored properly')

    #train Model    
    clf = RandomForestRegressor(n_estimators=trees, n_jobs=-1)
    clf.fit(independant, df[target])

    #Final leaf for each tree
    leaves = clf.apply(independant)
    #value in cluster dimension
    labels = df[cluster_dimension].values

    numerator_matrix = {}
    for i,value_i in enumerate(labels):
        for j,value_j in enumerate(labels):
            if i >= j:       
                numerator_matrix[(value_i,value_j)] = numerator_matrix.get((value_i,value_j), 0) + np.count_nonzero(leaves[i]==leaves[j])
                numerator_matrix[(value_j,value_i)] = numerator_matrix[(value_i,value_j)] 

    #normalize by the total number of possible matchnig leaves        
    prox_matrix = {key: 1.0 - float(x)/(trees*np.count_nonzero(labels==key[0])*np.count_nonzero(labels==key[1])) for key, x in numerator_matrix.iteritems()}                                                                  

    #make sorted dataframe                                                                                                                                                                                                                                                                
    levels = np.unique(labels)
    D = pd.DataFrame(data=[[ prox_matrix[(i,j)] for i in levels] for j in levels],index=levels,columns=levels)

    return D

kMedoids(D, k, tmax=100)