Random Forest – Low Explained Variance in R randomForest

rr-squaredrandom forestregression

I am using randomForest in R for regression, I have many categorical predictors (all of them have the same 3 categories (0,1,2)) and I want to see which of them can predict the response (continuous). I am trying this with many different response variables (one at the time) and all the models have a very low explained variance (basically 0, almost always negative).

I checked chi-square between pairs of variables and removed the ones that could be associated (p-value < 0.05), but the result is the same.

My questions are:

1 – Is this possible? Am I doing something very wrong without noticing? If no:

2 – In random forest, do I have to throw everything away or can I still use the variable importance for classifying the predictors? (I don't think so, but since I couldn't find anything about this, I still hope I can get something out of it – BTW why does the plot predicted vs observed look good??). If no:

3 – Any suggestion? Also for alternative methods?

In the example below I don't divide the data in training and test for simplicity, but I did it in my code – same problem. Also, my original data set is much bigger (>500 observations and almost 100 predictors)

## predictors
> pred
   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20
1   0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0
2   0  1  2  2  2  0  1  2  0   0   1   0   0   1   1   2   2   1   1   2
3   0  1  0  2  2  1  1  2  1   1   2   1   0   0   1   2   2   2   0   0
4   0  0  1  1  1  1  1  1  1   1   1   1   0   0   2   0   2   2   0   1
5   0  1  1  2  2  0  1  2  2   1   2   0   0   0   1   1   0   2   0   1
6   1  1  0  2  2  1  1  1  2   1   0   1   0   1   1   2   2   2   1   2
7   0  1  1  1  1  1  2  1  2   1   2   1   0   1   1   2   1   2   1   1
8   0  1  2  1  0  1  0  2  1   1   1   2   0   0   1   2   1   2   1   2
........

## response
> resp

[1]  19.416  46.058  39.496  79.752 301.012 746.377 277.721  13.922  15.598  82.195  86.263
[12]  82.522  30.829 101.369  31.496  39.366 133.510

## find optimal value of mtry for randomForest
> bestmtry <- tuneRF(pred, resp, ntreeTry=100,
+                    stepFactor=1.5,improve=0.01, trace=F, plot=F, dobest=FALSE)

## extract optimal value of mtry for randomForest
> ind <- as.numeric(names(which.min(bestmtry[,2])))

## Random Forest
> RF <-randomForest(pred, , y = resp, mtry=ind, ntree=500,
+ keep.forest=TRUE, importance=TRUE)

> RF

Call:
     randomForest(x = pred, y = resp, ntree = 500, mtry = ind, importance = TRUE,          keep.forest = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 9

          Mean of squared residuals: 32713.86
                    % Var explained: -6.5

## Low explained variance (pseudo - r sqaured)

> RF.pr = predict(RF,pred)

## the plot isn't that bad though... it is if I use the test data set though
> plot(RF.pr, resp)
> abline(c(0,1),col=2)

enter image description here

> varImpPlot(RF)

enter image description here

I have been stuck with this for a while now… any help is extremely appreciated

Best Answer

edited response*

Some few changes which may drive forward a little signal...

Scaling: RF is only scaling invariant to the features not to the responses. RFreg uses mean square error as loss function and CV squared residuals to assess performance. Try to take the logarithm or sqaure root to your responses to lower leverage of few 'outliers'.

Filtering: Use the function rfcv from randomForest to select variables. Otherwise a linear filter may be useful.

Collinearity filtering: "I checked chi-square between pairs of variables and removed the ones that could be associated (p-value < 0.05), but the result is the same." -Don't use a specific p-value threshold < 0.05. Use any threshold by any similarity measures which makes your model work (CV performance). Did you remove both members of the pairs?

Variable importance: Variable importance of broken model should not be trusted.

Evaluating RF perfomance: That RF do fit it's own training set is irrelevant. The trees of RFreg are grown almost to max depth and will overfit the training set. Only cross-validation(segmentaion, OOB, nFold, etc.) can be used to assess the performance. The following code shows how %var explained is computed and how the OOB prediction is made.

library(randomForest)
obs = 500
vars = 100
X = replicate(vars,factor(sample(1:3,obs,replace=T)))
y = rnorm(obs,sd=5)^2

RF = randomForest(X,y,importance=T,ntree=20,keep.inbag=T)

#var explained printed
print(RF)
cat("% Var explained: \n", 100 * (1-sum((RF$y-RF$pred   )^2) /
                                    sum((RF$y-mean(RF$y))^2)
                                  )
   )

#how out-of-bag predicted values are formed 
#matrix of i row obs with j col predictions from j trees
allTreePred = predict(RF,X,predict.all=T)$individual 
#for i'th sample take mean of those trees where i'th sample was OOB (inbag==0)
OOBpred = sapply(1:obs,function(i) mean(allTreePred[i,RF$inbag[i,]==0]))

#we can see the values are the same +/- float precision
hist(OOBpred-RF$predicted)

#if using RF to predict it's own training data
Ypred = predict(RF,X)

#any obs (i) will be present in ~0.62 of the nodes and influence it's own
#prediction value. Therefore does the following prediction plot falsely
#look promising
    par(mfrow=c(1,2),mar=c(4,4,3,3))
    ylims=range(c(pred,OOBpred))
    plot(y   ,Ypred,ylim=ylims,main=paste("simple pred \n
                                         R^2=" ,round(cor(y,Ypred  ),2)))
    plot(y,OOBpred,ylim=ylims,main=paste("OOB prediction \n
                                         R^2=" ,round(cor(y,OOBpred),2)))

enter image description here