Solved – Can we use random forest for classification in combination with distance matrix between classes

classificationfeature selectionrrandom forest

With a colleague, we are working on a dataset containing ~5000 continuous variables for 120 individuals belonging to 8 classes.

We want to estimate the relative importance of each variable to explain the classes.
We have used a random forest approach with some success.
Now, we could like to go deeper by considering the fact that the 8 classes we fit are unequally distant from each other.
In fact, in our case we can a priori generate a distance matrix (i.e. cost matrix) for all possible pairs of classes.

My (very limited) understanding of random forest is that, for regression problems,
the error $E$ is computed by the mean square difference between the OOB sample and the prediction for the same sample:

$E = n^{-1}\sum\limits_{i=1}^n{{(y_i-\hat{y}_i)}^2}$

Where $y_i$ is the predicted value and $\hat{y}_i$ the real value of an out-of bag-sample $i$.

Ultimately, the calculation of the variable importance depends on how the error is computed (right?).

In our case, I would like to use a modified loss function, for instance:

$E = n^{-1}\sum\limits_{i=1}^n{M_{y_i,\hat{y}_i}}$

Where $M$ is predefined a distance matrix; so $M_{a,b}$ is the distance between class $a$ and $b$.
In this way the misclassification error would be more important if $y_i$ and $\hat{y}_i$ represent distant classes and, ultimately, the variable importance should be more relevant.

My questions are:

Does this approach make sense to you, or am I missing something?
Can you think of any study that has used something similar.
We have so far used the randomForest package in R. It does not seem possible to use it in combination with an a priori distance matrix between classes. Do you know if this is already implemented somewhere?

EDIT

I believe this is a very frequent problem in my field, biology, because we deal with classes for which relations can be represented and quantified by trees (dendrograms), often because of there lineage.

After some research, it appears that my question is about using a cost-sensitive version of random forest. In this respect, it is very similar to this question.
I specificity want to use a cost matrix rather than a cost vector though.
It there any ontological reason why it is not possible or is it simply not implemented?

Best Answer

A random forest composed of not CART trees, but gradient boosted trees has the minimization of exponential errors as one of its priciples. (Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination by Tuv, Borisov, Runger, Torkkola (2009)). You should be able to weight your "output" vector (matrix multiply) and the boosted tree will account for that weight in its "training". (adaboost tree tutorial).

I have been giving some thought to the domain in terms of statistical design of experiment (DOE) and you can account for a euclidean distance and higher order interactions with the (very) weak learner of a CART by "manually" expanding your domain.

If you had univariate data in a single column (trivial case) titled "x" then you could make "x^2" or "x^3" and construct an augmented domain matrix as [$ x$ $ x^2$ $ x^3$ ...]. If you have two variables composing your domain as "$ x_1$" and "$ x_2$" then you could consider interactions as well as powers as composing your augmented domain matrix as [$ x_1$ $ x_1^2$ $ x_1^3$ ...$ x_2$ $ x_2^2$ $ x_2^3$ ... $ x_1 x_2$ $ x_1^2 x_2$ ... $ x_1 x_2^2$ ...]. If you were motivated you could also consider contrasts ($ x_1-x_2$), their powers, and their interaction with variables, variable powers, variable interactions, other contrasts and their powers.

Though this might substantially increase the number of columns, one of the great strengths of random forests is the ability to handle very high dimensional data. This is not a "no go" computationally or analytically.

This gives you a way to account for interactions and distances in the data without having to abandon your current tool.

It isn't exactly what you were looking for but satisfies some of your initial question and gives you some directions you can consider exploring.

Best wishes.

EDIT:

How to make a random forest algorithm cost sensitive: (link)

Related Solutions

Solved – How to calculate variable importance p-values using the randomForest package in R

It is called z-score mainly because it is mean/sd, but in practice it is useless for hypothesis testing -- in some cases you can get most important attribute with z~$10^{-3}$ or on the other side all z-scores way larger than this mystical 3.

The working (more-less) approach is for instance to compare attributes' importance to an importance of random dummy attributes added to the set. I have made a package, Boruta, that implements such idea.

Solved – mtry and unbalanced use of predictor variables in Random Forest

The part of the overall random forest algorithm that uses mtry is (adapted from The Elements of Statistical Learning):

At each terminal node that is larger than minimal size,

1) Select mtry variables at random from the $p$ regressor variables,

2) From these mtry variables, pick the best variable and split point,

3) Split the node into two daughter nodes using the chosen variable and split point.

As an aside - you can use the tuneRF function in the randomForest package to select the "optimal" mtry for you, using the out-of-bag error estimate as the criterion.

The random selection of variables at each node splitting step is what makes it a random forest, as opposed to just a bagged estimator. Quoting from The Elements of Statistical Learning, p 588 in the second edition:

The idea in random forests ... is to improve the variance reduction of bagging by reducing the correlation between the trees, without increasing the variance too much. This is achieved in the tree-growing process through random selection of the input variables.

There is no incremental increase in bias due to this. Of course, if the model itself is fundamentally biased, e.g., by leaving out important predictor variables, using random forests won't make the situation any better, but it won't make it worse either.

The unbalanced use of predictor variables just reflects the fact that some are less important than others, where important is used in a heuristic rather than a formal sense, and as a consequence, for some trees, may not be used often or at all. For example, think about what would happen if you had a variable that was barely significant on the full data set, but you then generated a lot of bootstrap datasets from the full data set and ran the regression again on each bootstrap dataset. You can imagine that the variable would be insignificant on a lot of those bootstrap datasets. Now compare to a variable that was extremely highly significant on the full dataset; it would likely be significant on almost all of the bootstrap datasets too. So if you counted up the fraction of regressions for which each variable was "selected" by being significant, you'd get an unbalanced count across variables. This is somewhat (but only somewhat) analogous to what happens inside the random forest, only the variable selection is based on "best at each split" rather than "p-value < 0.05" or some such.

EDIT in response to a question by the OP: Note, however, that variable importance measures are not based solely on counts of how many times a variable is used in a split. Consequently, you can have "important" variables (as measured by "importance") that are used less often in splits than less "important" variables (as measured by "importance".) For example, consider the model:

$ y_i = I(x_i > c) + 0.25*z_i^2 + e_i$

as implemented and estimated by the following R code:

x <- runif(500)
z <- rnorm(500)
y <- (x>0.5) + z*z/4 + rnorm(500)
df <- data.frame(list(y=y,x=x,z=z,junk1=rnorm(500),junk2=runif(500),junk3=rnorm(500)))
foo <- randomForest(y~x+z+junk1+junk2+junk3,mtry=2,data=df)
importance(foo)
      IncNodePurity
x         187.38456
z         144.92088
junk1     102.41875
junk2      93.61086
junk3      92.59587

varUsed(foo)
[1] 16916 17445 16883 16434 16453

Here $x$ has higher importance, but $z$ is used more frequently in splits; $x$'s importance is high but in some sense very local, while $z$ is more important over the full range of $z$ values.

For a fuller discussion of random forests, see Chap. 15 of The Elements..., which the link above allows you to download as a pdf for free.

Best Answer

Related Solutions

Solved – How to calculate variable importance p-values using the randomForest package in R

Solved – mtry and unbalanced use of predictor variables in Random Forest

Related Question