Regression Trees – Pooling Levels of Categorical Variables for Effective Regression Tree Construction

cartcategorical datahierarchical clusteringmany-categoriesregression

I have a data set I would like to do a regression analysis for. There are many features of both categorical and continuous types. One of the categorical features has many (>75) levels so this is an issue. I have reason to believe that some of the levels are essentially the same. I intend to use decision trees with some ensemble method (ie Bagging or boosting).

I would like to try to pool/cluster the levels of the problematic feature to improve performance. I realize that theoretically if the ensemble/number of leaves is large enough this is not necessary but I am already having computational issues.

Is there a standard method to combine levels which perform the same?

————-Edit————–

I think I found a method which would work. The idea would be to use use the proximity matrix. This is essentially the N_Obs by N_Obs matrix for the fraction of out of bag trees where the observations where in the same terminal node. We can then aggregate this into a level by level matrix where the elements are the average of the fraction in the proximity matrix. We would then pool all levels together when they past a threshold and see if this improves RMSE. It is likely best to take a step-wise iterative approach to find the optimal pooling but I might just take the threshold as the average of the diagonal. This should give a reasonable threshold because it represents how often each level is in the same terminal node as itself. Comments welcome, I will report back on results.

Best Answer

I have implemented my solution to this. I wrote two functions:

prox_matrix(df, target, features, cluster_dimension,trees = 10)

Parameters

  • df: Input dataframe
  • target: Dependant variable you are trying to predict with the random forrest
  • features: List of independent variables
  • cluster_dimension: Dimension you would like to cluster/pool to add to your list of features
  • trees: the number of trees to use in your Random Forest

Returns

  • D: DataFrame of the proximity matrix for the cluster_dimension

Code Below

def prox_matrix(df, target, features, cluster_dimension,trees = 10):
    #https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prox

    from sklearn.ensemble import RandomForestRegressor
    import numpy as np
    import pandas as pd

    #initialize datframe for independant variables
    independant = pd.DataFrame()

    #Handle Categoricals: This should really be added to RandomForestRegressor
    for column,data_type in df[features].dtypes.iteritems():       
        try:
            independant[column] = pd.to_numeric(df[column],downcast = 'integer')
        except ValueError:
            contains_nulls = df[column].isnull().values.any()
            dummies = pd.get_dummies(df[column],prefix=column,dummy_na=contains_nulls,drop_first=True)
            independant[dummies.columns] = dummies

    if len(independant.index) != len(df.index):
        raise Exception('independant variables not stored properly')

    #train Model    
    clf = RandomForestRegressor(n_estimators=trees, n_jobs=-1)
    clf.fit(independant, df[target])

    #Final leaf for each tree
    leaves = clf.apply(independant)
    #value in cluster dimension
    labels = df[cluster_dimension].values

    numerator_matrix = {}
    for i,value_i in enumerate(labels):
        for j,value_j in enumerate(labels):
            if i >= j:       
                numerator_matrix[(value_i,value_j)] = numerator_matrix.get((value_i,value_j), 0) + np.count_nonzero(leaves[i]==leaves[j])
                numerator_matrix[(value_j,value_i)] = numerator_matrix[(value_i,value_j)] 

    #normalize by the total number of possible matchnig leaves        
    prox_matrix = {key: 1.0 - float(x)/(trees*np.count_nonzero(labels==key[0])*np.count_nonzero(labels==key[1])) for key, x in numerator_matrix.iteritems()}                                                                  

    #make sorted dataframe                                                                                                                                                                                                                                                                
    levels = np.unique(labels)
    D = pd.DataFrame(data=[[ prox_matrix[(i,j)] for i in levels] for j in levels],index=levels,columns=levels)

    return D

kMedoids(D, k, tmax=100)

Parameters

  • D: Proximity/distance matrix
  • k: Number of clusters
  • tmax: Maximum number of iterations to check for convergence of clustering

Returns

  • M: List of mediods
  • C: Dictionary mapping the clustered levels to each mediod
  • S: Silhouette of each cluster for evaluation of performance

Code Below

def kMedoids(D, k, tmax=100):
    #https://www.researchgate.net/publication/272351873_NumPy_SciPy_Recipes_for_Data_Science_k-Medoids_Clustering
    import numpy as np
    import pandas as pd

    # determine dimensions of distance matrix D
    m, n = D.shape

    if m != n:
        raise Exception('matrix not symmetric')

    if sum(D.columns.values != D.index.values):
        raise Exception('rows and columns do not match')

    if k > n:
        raise Exception('too many medoids')

    #Some distance matricies will not have a 0 diagonal    
    Dtemp =D.copy()
    np.fill_diagonal(Dtemp.values,0)

    # randomly initialize an array of k medoid indices
    M = list(Dtemp.sample(k).index.values)

    # initialize a dictionary to represent clusters
    Cnew = {}

    for t in xrange(tmax):    
        # determine mapping to clusters
        J = Dtemp.loc[M].idxmin(axis='index')
        #Fill dictionary with cluster members
        C = {kappa: J[J==kappa].index.values for kappa in J.unique()}  
        # update cluster medoids
        Cnew = {Dtemp.loc[C[kappa],C[kappa]].mean().idxmin() : C[kappa] for kappa in C.keys()}       
        #Update mediod list
        M = Cnew.keys()

        # check for convergence (ie same clusters)
        if set(C.keys()) == set(Cnew.keys()):
            if not sum(set(C[kappa]) != set(Cnew[kappa]) for kappa in C.keys()): break            
    else:        
        print('did not converge')

    #Calculate silhouette 
    S = {}
    for kappa_same in Cnew.keys():
        a = Dtemp.loc[Cnew[kappa_same],Cnew[kappa_same]].mean().mean()
        b = np.min([Dtemp.loc[Cnew[kappa_other],Cnew[kappa_same]].mean().mean() for kappa_other in Cnew.keys() if kappa_other!=kappa_same])
        S[kappa_same] = (b - a) / max(a, b)

    # return results
    return M, Cnew, S

Notes:

  1. There are links to theory documentation in the code
  2. I used all records not strictly the OOB records. Follow up here
  3. The prox_matrix() method is very slow. I have done a few things to speed it up but most of the cost comes from the double loop. Updates welcome.
  4. The diagonal of the proximity matrix need not be zeros. I force this in the KMedoids method so that I get convergence.