Solved – Random forest on grouped data

random forestregression

I am using random forest on high-dimensional grouped data (50 numeric input variables) which have a hierachical structure. The data were collected with 6 replications at 30 positions of 70 different objects resulting in 12600 data points, which are not independent.

It seems random forest is over-fitting the data, since the oob error is much smaller than the error which we get when leaving data from one object out during training and then predicting the outcome of the left out object with the trained random forest. Moreover I have correlated residuals.

I think the overfitting is caused since random forest is expecting independent data. Is it possible to tell the random forest about the hierarchical structure of the data?
Or is there another powerful ensemble or shrinkage method that can handle high-dimensional grouped data with a strong interaction structure?

Any hint how I can do better?

Best Answer

Very late to the party as well, but I think that could be related to something I did a few years ago. That work got published here:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0093379

and is about dealing with variable correlation into ensemble of decision trees. You should have a look at the bibliography which is pointing to many proposal to deal with this type of issues (which is common in the "genetic" area).

The source code is available here (but is not really maintained anymore).

Related Question