I have a data set I would like to do a regression analysis for. There are many features of both categorical and continuous types. One of the categorical features has many (>75) levels so this is an issue. I have reason to believe that some of the levels are essentially the same. I intend to use decision trees with some ensemble method (ie Bagging or boosting).
I would like to try to pool/cluster the levels of the problematic feature to improve performance. I realize that theoretically if the ensemble/number of leaves is large enough this is not necessary but I am already having computational issues.
Is there a standard method to combine levels which perform the same?
————-Edit————–
I think I found a method which would work. The idea would be to use use the proximity matrix. This is essentially the N_Obs by N_Obs matrix for the fraction of out of bag trees where the observations where in the same terminal node. We can then aggregate this into a level by level matrix where the elements are the average of the fraction in the proximity matrix. We would then pool all levels together when they past a threshold and see if this improves RMSE. It is likely best to take a step-wise iterative approach to find the optimal pooling but I might just take the threshold as the average of the diagonal. This should give a reasonable threshold because it represents how often each level is in the same terminal node as itself. Comments welcome, I will report back on results.
Best Answer
I have implemented my solution to this. I wrote two functions:
prox_matrix(df, target, features, cluster_dimension,trees = 10)
Parameters
Returns
Code Below
kMedoids(D, k, tmax=100)
Parameters
Returns
Code Below
Notes: