I have a data set wherein there are 10 predictors (both continuous and categorical) while the dependent variable is a factor with levels 0 or 1. The event rate in my data (% of actual 1s) is 10%. However, when I apply random forest, it classifies only 5% observations as 1 and the rest 95% as 0. Why would this happen? Is it only related to the kind of variables I have and the transformations I have done or is it something that can be controlled by tuning parameters of the model?
Solved – Low accuracy in random forest
random forest
Related Solutions
edited response*
Some few changes which may drive forward a little signal...
Scaling: RF is only scaling invariant to the features not to the responses. RFreg uses mean square error as loss function and CV squared residuals to assess performance. Try to take the logarithm or sqaure root to your responses to lower leverage of few 'outliers'.
Filtering: Use the function rfcv from randomForest to select variables. Otherwise a linear filter may be useful.
Collinearity filtering: "I checked chi-square between pairs of variables and removed the ones that could be associated (p-value < 0.05), but the result is the same." -Don't use a specific p-value threshold < 0.05. Use any threshold by any similarity measures which makes your model work (CV performance). Did you remove both members of the pairs?
Variable importance: Variable importance of broken model should not be trusted.
Evaluating RF perfomance: That RF do fit it's own training set is irrelevant. The trees of RFreg are grown almost to max depth and will overfit the training set. Only cross-validation(segmentaion, OOB, nFold, etc.) can be used to assess the performance. The following code shows how %var explained is computed and how the OOB prediction is made.
library(randomForest)
obs = 500
vars = 100
X = replicate(vars,factor(sample(1:3,obs,replace=T)))
y = rnorm(obs,sd=5)^2
RF = randomForest(X,y,importance=T,ntree=20,keep.inbag=T)
#var explained printed
print(RF)
cat("% Var explained: \n", 100 * (1-sum((RF$y-RF$pred )^2) /
sum((RF$y-mean(RF$y))^2)
)
)
#how out-of-bag predicted values are formed
#matrix of i row obs with j col predictions from j trees
allTreePred = predict(RF,X,predict.all=T)$individual
#for i'th sample take mean of those trees where i'th sample was OOB (inbag==0)
OOBpred = sapply(1:obs,function(i) mean(allTreePred[i,RF$inbag[i,]==0]))
#we can see the values are the same +/- float precision
hist(OOBpred-RF$predicted)
#if using RF to predict it's own training data
Ypred = predict(RF,X)
#any obs (i) will be present in ~0.62 of the nodes and influence it's own
#prediction value. Therefore does the following prediction plot falsely
#look promising
par(mfrow=c(1,2),mar=c(4,4,3,3))
ylims=range(c(pred,OOBpred))
plot(y ,Ypred,ylim=ylims,main=paste("simple pred \n
R^2=" ,round(cor(y,Ypred ),2)))
plot(y,OOBpred,ylim=ylims,main=paste("OOB prediction \n
R^2=" ,round(cor(y,OOBpred),2)))
There are several components to your question. These include (but are not limited to): 1) constrained variable selection when the number of observations (n) is small relative to the number of predictors (p), 2) heuristic selection vs optimization, 3) dealing with mixtures of distributions among the predictors, 4) comparison of the fit between predicted and actual, 5) finding an appropriate model for y, and 6) separating statistical understanding from pure, machine learning prediction.
I'm not an advocate of optimizing approaches to variable selection as wasteful of CPU. Moreover, given the smallish n and p which is not so big anyway, I think leveraging an RF would be methodological overkill. A more useful model-building step (assuming some exploratory work has been done to assess whether or not applying transformations improves the fit) would be heuristic evaluation of pair-wise relationships between y and the candidate predictors. The idea here is that if a potential predictor does not have at least a modestly significant relationship with y, then it probably can be eliminated. This could be done in an ANOVA-type context using a relaxed significance of p<=.15 or so for inclusion. Of course, causal purists would argue that tertiary (masking) and/or interaction effects can be lost this way but, in practice, these tertiary effects are usually small if they are significant at all. Besides, a better guide to including tertiary effects is prior theoretical insight. The advantage of using ANOVA is that it is invariant to the scale (mixture) of distributions, providing a measure (the F-statistic) of relative effect sizes leading to an preliminary importance ranking in selecting predictors. Of course, ANOVAs use linear assumptions. If you think the relationships are nonlinear then there are many tools now for evaluating nonlinear dependence but these require a level of sophistication that your question belies.
The under/over weighting you point out in the scatterplot is a bit of a red herring since it's benchmarked against an orthogonal, 45-degree line. The better comparison would be to an average line of best fit, which would be demonstrably balanced.
In terms of appropriate models for your data, I don't see any reason why classic OLS estimation wouldn't provide reasonable insight. Of course, there are other methods such as partial least squares which are designed specifically for situations where p>>n but your mixture of distributions precludes their use.
You haven't indicated what your "high-level" goal is. Are you simply trying to find a good, predictive fit or are you trying to uncover some underlying process in order to gain insight into causality? Either way, by choosing predictors that maximize the predictive fit, you have put a stake in the ground in terms of understanding causality.
Final model variable selection would be based on those variables that passed the threshold of relaxed significance and could be identified using the lasso, a widely available variable selection method.
Best Answer
It is quite a common issue when dealing with unbalanced datasets. Try using under or oversampling and/or choose a different performance measure for training (ie. ROC AUC).