Solved – Random forest variable importance in h2o (classification problem)

h2orrandom forest

I cannot find out how the variable importance for classification problems is calculated in h2o. There is a Stackoverflow question asking the same, but the accepted answer does not help (keeps referring to "squared error" where I would expect "accuracy" or "gini impurity" being used; same for the linked paper in that SO thread).

Note that h2o seems to use a different methodology for calculating variable importance than the usual permutation approach, see the h2o documentation

How is variable importance calculated for DRF? Variable importance is
determined by calculating the relative influence of each variable:
whether that variable was selected during splitting in the tree
building process and how much the squared error (over all trees)
improved as a result.

So, I tried to figure out how h2o calculates variable importance myself. Here a simple single-tree example (using all data for training)

library(h2o)
data(iris)

h2o.init()
irisSimple=iris
irisSimple$Species=factor(ifelse(irisSimple$Species=="virginica",
                                 "virginica","other"))
mdl=h2o.randomForest(x=setdiff(colnames(irisSimple),"Species"),
                     y="Species",training_frame=as.h2o(irisSimple),
                     sample_rate=1.0,ntrees=1,seed=1)

We can look into the single tree via exporting to a POJO

pojo=capture.output(h2o.download_pojo(mdl))

Now extract and print the first split node

pojo[grepl("double pred = ",pojo)]
#double pred = (Double.isNaN(data[3]) || data[3 /* Petal.Width */] <1.75f ?

Calculate left (true) and right (false) data bins

lBin=irisSimple[irisSimple$Petal.Width<1.75,]
rBin=irisSimple[irisSimple$Petal.Width>=1.75,]

Finally calculate accuracy increase

rootCorrect=max(table(irisSimple$Species))
lCorrect=max(table(lBin$Species))
rCorrect=max(table(rBin$Species))
accIncrease=(lCorrect+rCorrect-rootCorrect)/nrow(iris)
accIncrease
#[1] 0.29333

and compare to the h2o result

h2o.varimp(mdl)
#Variable Importances: 
#      variable relative_importance scaled_importance percentage
#1  Petal.Width           28.585253          1.000000   0.857558
#2 Petal.Length            3.081414          0.107797   0.092442
#3  Sepal.Width            1.000000          0.034983   0.030000
#4 Sepal.Length            0.666667          0.023322   0.020000

Summing up sum(h2o.varimp(mdl)$relative_importance) gives 33.33 indicating that relative_importance refers to the accuracy increase (the naive model assigning "other" to all observations has 50 observations wrong; the decision tree gets all 150 observations right).

As you can see, my calculated accuracy increase of 0.29333 for the Petal.Width split point is larger than the h2o value of 0.28585.

So, I am wondering what numbers h2o is reporting…

BTW:

packageVersion("h2o")
#[1] ‘3.10.5.3’

Best Answer

This isn't necessarily an answer, more so a question. I just can't comment yet :(

Could it be as simple as what the split decision is? In other words, if the tree says 1.75 as the split value, will that be the exact place it splits? Or is it rounding by chance, so that it's actually splitting at 1.7544 or something? Not sure if this would make the difference, but 0.29333 and 0.28585 are close enough together that it seems like the result of a rounding error... Just some thoughts. Sorry if it doesn't help.

Related Question