Solved – Can we use random forest for classification in combination with distance matrix between classes

classificationfeature selectionrrandom forest

With a colleague, we are working on a dataset containing ~5000 continuous variables for 120 individuals belonging to 8 classes.

We want to estimate the relative importance of each variable to explain the classes.
We have used a random forest approach with some success.
Now, we could like to go deeper by considering the fact that the 8 classes we fit are unequally distant from each other.
In fact, in our case we can a priori generate a distance matrix (i.e. cost matrix) for all possible pairs of classes.

My (very limited) understanding of random forest is that, for regression problems,
the error $E$ is computed by the mean square difference between the OOB sample and the prediction for the same sample:

$E = n^{-1}\sum\limits_{i=1}^n{{(y_i-\hat{y}_i)}^2}$

Where $y_i$ is the predicted value and $\hat{y}_i$ the real value of an out-of bag-sample $i$.

Ultimately, the calculation of the variable importance depends on how the error is computed (right?).

In our case, I would like to use a modified loss function, for instance:

$E = n^{-1}\sum\limits_{i=1}^n{M_{y_i,\hat{y}_i}}$

Where $M$ is predefined a distance matrix; so $M_{a,b}$ is the distance between class $a$ and $b$.
In this way the misclassification error would be more important if $y_i$ and $\hat{y}_i$ represent distant classes and, ultimately, the variable importance should be more relevant.

My questions are:

  1. Does this approach make sense to you, or am I missing something?
  2. Can you think of any study that has used something similar.
  3. We have so far used the randomForest package in R. It does not seem possible to use it in combination with an a priori distance matrix between classes. Do you know if this is already implemented somewhere?

EDIT

I believe this is a very frequent problem in my field, biology, because we deal with classes for which relations can be represented and quantified by trees (dendrograms), often because of there lineage.

After some research, it appears that my question is about using a cost-sensitive version of random forest. In this respect, it is very similar to this question.
I specificity want to use a cost matrix rather than a cost vector though.
It there any ontological reason why it is not possible or is it simply not implemented?

Best Answer

A random forest composed of not CART trees, but gradient boosted trees has the minimization of exponential errors as one of its priciples. (Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination by Tuv, Borisov, Runger, Torkkola (2009)). You should be able to weight your "output" vector (matrix multiply) and the boosted tree will account for that weight in its "training". (adaboost tree tutorial).

I have been giving some thought to the domain in terms of statistical design of experiment (DOE) and you can account for a euclidean distance and higher order interactions with the (very) weak learner of a CART by "manually" expanding your domain.

If you had univariate data in a single column (trivial case) titled "x" then you could make "x^2" or "x^3" and construct an augmented domain matrix as [$ x$ $ x^2$ $ x^3$ ...]. If you have two variables composing your domain as "$ x_1$" and "$ x_2$" then you could consider interactions as well as powers as composing your augmented domain matrix as [$ x_1$ $ x_1^2$ $ x_1^3$ ...$ x_2$ $ x_2^2$ $ x_2^3$ ... $ x_1 x_2$ $ x_1^2 x_2$ ... $ x_1 x_2^2$ ...]. If you were motivated you could also consider contrasts ($ x_1-x_2$), their powers, and their interaction with variables, variable powers, variable interactions, other contrasts and their powers.

Though this might substantially increase the number of columns, one of the great strengths of random forests is the ability to handle very high dimensional data. This is not a "no go" computationally or analytically.

This gives you a way to account for interactions and distances in the data without having to abandon your current tool.

It isn't exactly what you were looking for but satisfies some of your initial question and gives you some directions you can consider exploring.

Best wishes.

EDIT:

How to make a random forest algorithm cost sensitive: (link)