Solved – What does z-score mean in Boruta

borutafeature selectionrandom forestz-score

The boruta algorithm performs the following steps. (here). Can anyone explain me what exactly the z-score means in this context?

I am referring to point 5 and 6 in the following list. I only know the general formula of the z-score: $z = \frac{x-\mu}{\sigma}$

1.Create duplicate copies of all independent variables. When the number of independent variables in the original data is less than 5, create at least 5 copies using existing variables.

2.Shuffle the values of added duplicate copies to remove their correlations with the target variable. It is called shadow features or permuted copies.

3.Combine the original ones with shuffled copies

4.Run a random forest classifier on the combined dataset and performs a variable importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each variable where higher means more important.

5.Then Z score is computed. It means mean of accuracy loss divided by standard deviation of accuracy loss.

6.Find the maximum Z score among shadow attributes (MZSA)

7.Tag the variables as 'unimportant' when they have importance significantly lower than MZSA. Then we permanently remove them from the process.

8.Tag the variables as 'important' when they have importance significantly higher than MZSA.

9.Repeat the above steps for predefined number of iterations (random forest runs), or until all attributes are either tagged 'unimportant' or 'important', whichever comes first.

Best Answer

This is described in the second chapter of the Feature Selection with the Boruta Package paper by Kursa and Rudnicki:

Boruta algorithm is a wrapper built around the random forest classification algorithm [...] It is an ensemble method in which classification is performed by voting of multiple unbiased weak classifiers — decision trees. These trees are independently developed on different bagging samples of the training set. The importance measure of an attribute is obtained as the loss of accuracy of classification caused by the random permutation of attribute values between objects. It is computed separately for all trees in the forest which use a given attribute for classification. Then the average and standard deviation of the accuracy loss are computed. Alternatively, the $Z$ score computed by dividing the average loss by its standard deviation can be used as the importance measure. Unfortunately the $Z$ score is not directly related to the statistical significance of the feature importance returned by the random forest algorithm, since its distribution is not $N(0, 1)$ (Rudnicki, Kierczak, Koronacki, and Komorowski 2006). Nevertheless, in Boruta we use $Z$ score as the importance measure since it takes into account the fluctuations of the mean accuracy loss among trees in the forest.