It looks to me like classregtree is just building a tree, not using any of these methods, all of which are supplementary to tree building. That is, classregtree is implementing the methods described in Breiman et al., per the reference given in the documentation. It builds a tree and then (by default) prunes it.
I have implemented my solution to this. I wrote two functions:
prox_matrix(df, target, features, cluster_dimension,trees = 10)
Parameters
- df: Input dataframe
- target: Dependant variable you are trying to predict with the random forrest
- features: List of independent variables
- cluster_dimension: Dimension you would like to cluster/pool to add to your list of features
- trees: the number of trees to use in your Random Forest
Returns
- D: DataFrame of the proximity matrix for the cluster_dimension
Code Below
def prox_matrix(df, target, features, cluster_dimension,trees = 10):
#https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prox
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import pandas as pd
#initialize datframe for independant variables
independant = pd.DataFrame()
#Handle Categoricals: This should really be added to RandomForestRegressor
for column,data_type in df[features].dtypes.iteritems():
try:
independant[column] = pd.to_numeric(df[column],downcast = 'integer')
except ValueError:
contains_nulls = df[column].isnull().values.any()
dummies = pd.get_dummies(df[column],prefix=column,dummy_na=contains_nulls,drop_first=True)
independant[dummies.columns] = dummies
if len(independant.index) != len(df.index):
raise Exception('independant variables not stored properly')
#train Model
clf = RandomForestRegressor(n_estimators=trees, n_jobs=-1)
clf.fit(independant, df[target])
#Final leaf for each tree
leaves = clf.apply(independant)
#value in cluster dimension
labels = df[cluster_dimension].values
numerator_matrix = {}
for i,value_i in enumerate(labels):
for j,value_j in enumerate(labels):
if i >= j:
numerator_matrix[(value_i,value_j)] = numerator_matrix.get((value_i,value_j), 0) + np.count_nonzero(leaves[i]==leaves[j])
numerator_matrix[(value_j,value_i)] = numerator_matrix[(value_i,value_j)]
#normalize by the total number of possible matchnig leaves
prox_matrix = {key: 1.0 - float(x)/(trees*np.count_nonzero(labels==key[0])*np.count_nonzero(labels==key[1])) for key, x in numerator_matrix.iteritems()}
#make sorted dataframe
levels = np.unique(labels)
D = pd.DataFrame(data=[[ prox_matrix[(i,j)] for i in levels] for j in levels],index=levels,columns=levels)
return D
kMedoids(D, k, tmax=100)
Parameters
- D: Proximity/distance matrix
- k: Number of clusters
- tmax: Maximum number of iterations to check for convergence of clustering
Returns
- M: List of mediods
- C: Dictionary mapping the clustered levels to each mediod
- S: Silhouette of each cluster for evaluation of performance
Code Below
def kMedoids(D, k, tmax=100):
#https://www.researchgate.net/publication/272351873_NumPy_SciPy_Recipes_for_Data_Science_k-Medoids_Clustering
import numpy as np
import pandas as pd
# determine dimensions of distance matrix D
m, n = D.shape
if m != n:
raise Exception('matrix not symmetric')
if sum(D.columns.values != D.index.values):
raise Exception('rows and columns do not match')
if k > n:
raise Exception('too many medoids')
#Some distance matricies will not have a 0 diagonal
Dtemp =D.copy()
np.fill_diagonal(Dtemp.values,0)
# randomly initialize an array of k medoid indices
M = list(Dtemp.sample(k).index.values)
# initialize a dictionary to represent clusters
Cnew = {}
for t in xrange(tmax):
# determine mapping to clusters
J = Dtemp.loc[M].idxmin(axis='index')
#Fill dictionary with cluster members
C = {kappa: J[J==kappa].index.values for kappa in J.unique()}
# update cluster medoids
Cnew = {Dtemp.loc[C[kappa],C[kappa]].mean().idxmin() : C[kappa] for kappa in C.keys()}
#Update mediod list
M = Cnew.keys()
# check for convergence (ie same clusters)
if set(C.keys()) == set(Cnew.keys()):
if not sum(set(C[kappa]) != set(Cnew[kappa]) for kappa in C.keys()): break
else:
print('did not converge')
#Calculate silhouette
S = {}
for kappa_same in Cnew.keys():
a = Dtemp.loc[Cnew[kappa_same],Cnew[kappa_same]].mean().mean()
b = np.min([Dtemp.loc[Cnew[kappa_other],Cnew[kappa_same]].mean().mean() for kappa_other in Cnew.keys() if kappa_other!=kappa_same])
S[kappa_same] = (b - a) / max(a, b)
# return results
return M, Cnew, S
Notes:
- There are links to theory documentation in the code
- I used all records not strictly the OOB records. Follow up here
- The prox_matrix() method is very slow. I have done a few things to speed it up but most of the cost comes from the double loop. Updates welcome.
- The diagonal of the proximity matrix need not be zeros. I force this in the KMedoids method so that I get convergence.
Best Answer
It seems like you understand that you're able to have
n
levels, as opposed ton-1
, because unlike in linear regression you don't need to worry about perfect colinearity.(I'm coming at this from an R perspective, but I assume it's the same in Python.) That depends on a couple of things, such as 1) which package you're using and 2) how many factor levels you have.
1) If you are using R's
randomForest
package, then if you have <33 factor levels then you can go ahead and leave them in one feature if you want. That's because in R's random forest implementation, it will check to see which factor levels should be on one side of the split and which on the other (e.g., 5 of your levels might be grouped together on the left side, and 7 might be grouped together on the right). If you split the categorical feature out inton
dummies, then the algorithm would not have this option at its disposal.Obviously if the particularly package you're using can't handle categorical features then you'd just need to create
n
dummy variables.2) As I alluded to above, R's random forest implementation can only handle 32 factor levels - if you have more than that then you either need to split your factors into smaller subsets, or create a dummy variable for each level.