Solved – Random forest regression produce different importance ranking

importancerrandom forestregression

I've been working a random forest regression model with my own datasets using the randomForest package.

The model produces a % var explained of ~27%. I also used the importance() function to produce a ranking of the relative importance of the variables, but everytime I use a different number for the set.seed() function, the random forest model produces a different ranking. It should be mentioned that I have several variables whose importance are relatively close.

I am wondering is there a consistent way to produce the ranking of the variable importance?

Best Answer

No- no consistent variable importance value is possible in this case.

The variable importance changes because the underlying model itself changes each time you change the seed value. Different models will have different values for variable importance.

One thing that you could do would be run many (say, 10) Random Forest models with different seed values and average the variable importance scores across the models- this would get you a better approximation of what you could expect variable importance to be on average from each individual Random Forest model you train on the dataset.

Related Solutions

Solved – random forest variables importance with continuous and categorical variables and unbalanced output

Random forests for classification might use two kind of variable importance. See the original description of the RF here.

"I know that the standard approach based the Gini impurity index is not suitable for this case due the presence of continuos and categorical input variables"

This is plain wrong. The gini impurity is build using only the proportions of the target/dependent variable, when split by a test which involves either numerical or nominal independent variable. Note that the independent variable plays a role only for building the split test, the computation of gini index is based only on counts on dependent variable after split. Of course, the gini impurity index on each node is used further to compute gini importance.

I do not know if that would count, but my personal experiments revealed taht there are no big differences between gini variable importance and permutation value importance. And I usually prefered the former.

The second problem is the unbalance of the samples labeled with 1 and 0. I think this might play a role on variable importance, but to be honest I would verify if this is the case. Thus I would repeat many times various computations of variable importance with various sampels having different proportions varying gradually from 0.5 ratio to the actual ration. I expect that finding a stable variable importance no matter proportion to not be so unexpected.

[later edit]

It took me some time to compile the document provided by @Donbeo. I agree with the results from that paper and I hope that I would further experiment myself with that. The only think which I do not like about that study is that it does not state which number of trees were used and what would imply the variation of this parameter. The single note regarding that is that the number of trees affects the scaled version for permutation tests.

Solved – Why gradient boosting/random forest generate “unstable” feature importance

A little toy example that might provide some perspective.

Let's create a dataset with a number of features that have the same informative content. What the dataset says, in a nutshell is: all the features for class 1 lie in a specific range. Same holds for class 0. In order to classify correctly the dataset, it would be sufficient to look at only one of the generated features.
Let's feed this to a classifier to extract the calculated feature importance score; and let's repeat this experiment a number of times.
Let's chart the importance of each feature as calculated in each experiment.

Note that the train set is set constant.

We repeat the same steps with a dataset where instead only 3 features are meaningful (equally meaningful).

import pandas as pd
import numpy as np
from sklearn import ensemble
import seaborn as sns
import matplotlib.pyplot as plt

N = 1000
def generate_redundant_features(low, high, class_val, n_feats=9):
    df = pd.DataFrame({
        'ft_'+str(i): np.random.uniform(low=low, high=high, size=N) for i in range(0, n_feats)
    })
    df["C"] = class_val
    return df

c0 = generate_redundant_features(0.0, 0.6, 0.0)
c1 = generate_redundant_features(0.6, 1.0, 1.0)
data_with_redundant_features = c0.append(c1, ignore_index=True)

def calculate_feature_importances(values, classifier, n_feats=9):
    features = [
        "ft_"+str(i) for i in range(0,n_feats)
    ]
    if classifier == "rf":
        clf = ensemble.RandomForestClassifier()
    elif classifier == "gbm":
        clf = ensemble.GradientBoostingClassifier()
    else:
        raise ValueError("I don't work with such a classifier")
    clf.fit(values[features], values.C)
    importances = [
        {
            'feature': 'ft_'+ str(i),
            'value': clf.feature_importances_[i]
        }
        for i in range(0, n_feats)
    ]
    return importances


def run_feature_importance_experiments(data, classifier, number_of_iterations=30):
    feature_importances = []
    for i in range(0, number_of_iterations):
        feature_importances += calculate_feature_importances(data, classifier)
    return pd.DataFrame(feature_importances)


def generate_data_with_three_meaningful_features(n_feats=9):
    df = pd.DataFrame({
        'ft_'+str(i): np.random.uniform(size=N) for i in range(0, n_feats)
    })
    df["C"] = ((df.ft_1 > 0.5) & (df.ft_2 > 0.5) & (df.ft_3 > 0.5)).astype(int)
    return df

data_only_three_meaningful_features = generate_data_with_three_meaningful_features()


def chart_by_classifier(classifier):
    # run the experiments where we calculate the importances
    df_importances_redundant_features = run_feature_importance_experiments(data_with_redundant_features, classifier_type)
    df_only_three_meaningful_feature = run_feature_importance_experiments(data_only_three_meaningful_features, classifier_type)
    # produce the chart
    f, (ax1, ax2) = plt.subplots(2)
    sns.stripplot(x="feature", y="value", data=df_importances_redundant_features, jitter=0.1, ax=ax1)
    sns.stripplot(x="feature", y="value", data=df_only_three_meaningful_feature, jitter=0.1, ax=ax2)
    plt.suptitle ("Classifier: " + classifier)

for classifier_type in ["rf", "gbm"]:
    chart_by_classifier(classifier_type)

This is how the importance features change across the experiments, when we use a random forest classifier (rf). The top chart is the case of redundant features. The latter is the dataset where only three features are meaningful.

and a gradient boosting machine (gbm):

A few notes:

The volatility in the feature importance scores depends on the degree of "redundancy" in the features, where "redundancy" could be measured in many different ways: correlation, mutual information, ..

If we compare the bottom charts for the rf and the gbm, we see a rather common(*) situation: the rf regularization mechanism (the sampling of feature every time a new decision tree is grown) introduces "variance" in the importance scores (but note that the bubbles for the three meaningful features wiggle around 0.3). The RF might also assign non-zero scores to meaningless variables.

On the other hand, the gbm pins down the scores. This is a result of the "boosting". Nevertheless, you've got to be careful: you will have to bootstrap your data (as mentioned earlier). If we sample 5 different sets from the same distribution and calculate the importance scores generated by the gbm for the 3 relevant features:

runs = 5
f, axarr = plt.subplots(1, runs, sharey=True)
for e in range(0, runs):
    a_sample_with_three_meaningful_features = generate_data_with_three_meaningful_features()
    scores_for_this_experiement = run_feature_importance_experiments(a_sample_with_three_meaningful_features, "gbm")
    ftrs_charted = ["ft_" + str(i) for i in range(1, 4)]
    only_meaningful_features = scores_for_this_experiement[scores_for_this_experiement.feature.isin(ftrs_charted)]
    sns.stripplot(x="feature", y="value", data=only_meaningful_features, jitter=0.1, ax=axarr[e])

.. here we go. All it makes sense I guess: in the end, the three relevant features are all "equally important". If we would run the gbm on N samples, the scores for the 3 features would average 0.3.

(*) based on my "practical experience" - it'd be cool to see some formal piece of literature on this

Best Answer

Related Solutions

Solved – random forest variables importance with continuous and categorical variables and unbalanced output

Solved – Why gradient boosting/random forest generate “unstable” feature importance

Related Question