Solved – How does random Forest work for regression

machine learningrandom forestregression

I am an absolute beginner in field of machine learning, I started doing titanic assignment in Kaggle and found(read some where) Random Forest is the best fit. I started reading about random forest and found the Explanation by Edwin Chen in this question intuitive. This made me "understand" how I can solve the Titanic assignment which predicts if one survives or not(classification). But I cannot understand How random Forest will work for regression which is continuous.

Please don't mind to point out any mistakes in my assumptions or the way I started things. Any advice would be helpful, This looks very vast and Don't even know where to begin.

Best Answer

Basically there are two differences:

when building the model/tree it is used a different criteria to split information; for example on binary split the purpose is to choose the split variable and split value so that the sum of variances of the two resulting data points on target/output variable is minimal
when you predict values you will use the mean value of the target/output variable for all data points in the leaf node

Some variants I saw:

for splitting you might want to minimize the sum of standard deviations, the weighed sum of variances, etc
for prediction values you can also use trimmed mean, median or even another model (like a linear model fitted on instances from the node)

Related Solutions

Solved – Linear Regression Intuition behind least squares

Solution $\beta=(x^Tx)^{-1}x^Ty$ can be justified by following three arguments:

It is a method of moments estimator which solves certain population moment conditions
It minimizes L2 norm
It is a maximum likelihood estimator when residuals follow Gaussian distribution

Second argument is about mathematical optimization and it does not rely on statistical properties of this estimator.

There is a Gauss-Markov-Aitken theorem which states that amongst linear unbiased estimators (generalized) least squares has a minimum variance so that it is BLUE (best linear unbiased estimator). Only constraint for this is that residuals has to be spherical.

Solved – Effect of categorical interaction terms with random forest machine learning algorithm

I think your questions are very interesting, I spend some of my time looking at the effective mapping curvature of random forest(RF) model fits. RF can capture some orders of interactions depending on the situation. x1 * x2 is a two-way interaction and so on... You did not write how many levels your categorical predictors had. It matters a lot. For continous variables(many levels) often no more than multiple local two-way interactions can be captured. The problem is, that the RF model itself only splits and do not transform data. Therefore RF is stuck with local uni-variate splits which is not optimal for captivating interactions. Therefore RF is fairly shallow compared to deep-lerning. In the complete other end of the spectre are binary features. I did not know how deep RF can go, so I ran a grid-search simulation. RF seems to capture up to some 4-8 orders of interactions for binary features. I use 12 binary variables and 100 to 15000 observations. E.g. for the 4th order interaction, the prediction vector y is:

orderOfInteraction = 4
y = factor(apply(X[,1:orderOfInteraction],1,prod))

where any element of X either is -1 or 1 and the product of the first four variable columns of X is the prediction. All four variables are completely complimentary. Therefore, no main-effects, 2nd or 3rd order effects. The OOB prediction error will therefore reflect only how well RF can captivate an interaction of the Nth order.

Things which makes RF captivate higher order of interactions: plenty of observation, few levels in variables, few variables

Limiting factors for RF captivating higher orders: the opposite of above, limited sampsize, limited maxnodes and redundant/sufficient lower order information.

The last one means that if RF can find the same information in low-order interactions, there is, so to say, no need to go deeper. Information may not even be redundant. It just have to be sufficient for RF to make correct binary predictions.

Depth of random forest: OOB err.rate vs. observations vs. order of interaction

  rm(list=ls())
  library(randomForest)
  library(parallel)
  library(rgl)

  simulate.a.forest = function(std.pars,ite.pars) {
    #Merge standard parameters with iterated parameters
    run.pars = c(std.pars,ite.pars)

    #simulate data of a given order
    X = replicate(run.pars$vars,sample(c(-1,1),run.pars$obs,replace=T))
    y = factor(apply(X[,1:run.pars$intOrder],1,prod))

    #run forest with run.pars, pars with wrong name is ignored
    rfo = do.call(randomForest, run.pars)

    #Fetch OOB error.rate and return
    out = rev(rfo$err.rate[,1])[1] #fetch error rate from object
    names(out) = paste(ite.pars,collapse="-")[1]
    return(out)
  }

  ## Lets try some situations (you can also pass arguments to randomForest here)
  intOrders = c(2,3,4,5,6,12) #hidden signal is a N-way interaction of Nth order
  obss = c(100,500,1000,3500,7000,15000) #available observations

  ## Produce list of all possible combinations of parameters
  ite.pars.matrix = expand.grid(intOrder=intOrders,obs=obss)
  n.runs = dim(ite.pars.matrix)[1]
  ite.pars.list   = lapply(1:n.runs, function(i) ite.pars.matrix[i,])

  i=1 ##for test-purposes
  out = mclapply(1:n.runs, function(i){
    #line below can be run alone without mclapply to check for errors before going multicore
    out = simulate.a.forest(std.pars=alist(x=X,y=y,
                                           ntree=250,
                                           vars=12),
                                           #sampsize=min(run.pars$obs,2000)),
                            ite.pars=ite.pars.list[[i]])
    return(out)
  })

  ##view grid results
  persp3d(x = intOrders,xlab="Nth order interaction",
          y = log(obss,base=10),ylab="10log(observations)",
          z = matrix(unlist(out),nrow=length(intOrders)),zlab="OOB prediction error, binary target",
          col=c("grey","black"),alpha=.2)

  rgl.snapshot(filename = "aweSomePlot.png", fmt = "png", top = TRUE)

Best Answer

Related Solutions

Solved – Linear Regression Intuition behind least squares

Solved – Effect of categorical interaction terms with random forest machine learning algorithm

Related Question