Solved – has anyone implemented an autoencoder with random forests

autoencodersboostingdeep learningrandom forest

I'm interested in exploring autoencoders which can be used to develop a compressed representation of data useful for machine learning.

In my experience random forests are easier to work with and more flexible than linear models and so I'd like to try to use them to build an autoencoder.

One might do this by using random forests to predict multiple outcomes then re-represent each data point as a binary sequence corresponding to the braches that it took. For example if a forest consisted of two trees with three branches then the code 011 101 would represent a datapoint that took the second and third branches of the first tree and took the first and third branch of the second tree.

Is anyone familiar with work like this? I am interested in papers, implementations of multi-outcome random forests, and techniques that convert random forests into binary representations of data points.

Edit: clarifying

Best Answer

You can use the 1-hot encoding - for a single tree, each example is represented by a vector containing 1 with the selected leaf, and combine these vectors for a forest (either concatenated or OR'ed). This gives you an intermediate representation.

Another option is to use the proximity measure [1] to compute an unsupervised sparse feature representation- a matrix M where M_ij = #times examples i,j terminated in the same leaf (over the entire forest). This matrix is sparse and large but you can reduce its size.

Do either of those give a useful intermediate representation? I don't know of any attempts at deep learning with random forests..

[1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prox

Related Question