Solved – important features in RandomForest – Sklearn

classificationmachine learningrandom forest

1) How to find important features in RandomForest classifier (in sklearn) with high statistical significant?

2) The input data I have is unbalanced which I simply repeat data to compensate that. When I do shuffling, some of these repeated data go to training and some goes to test which definitely increase the prediction accuracy. I know in reality it is not correct but what about if I only want to find important feature?

3) What is the meaning of a feature value (for example, 0.05) in randomforest? in Feature importance it is said that this value means 5% of data is classified correctly by this feature? To find the most important features can I sum up their values till become 0.9 and then I say 90% of data is classified correctly by this features?

Best Answer

The answer to this question describes how feature importances are computed in sklearn. Maybe it will help you with your questions #1 and #3.

Regarding question #1: It does not seem that this definition of importance is explicitly related to statistical significance.

Regarding question #2: You could still report the feature importances reported by sklearn, but they would be defined relative to your augmented data set and therefore seem more difficult to interpret. Using this augmented data set would at least change the "proportion of samples reaching that node" portion of the importance score equation (when compared with the original data set). By the way, I've had good results using SMOTE over-sampling of the minority class, combined with under-sampling of the majority class, when dealing with imbalanced data sets.

Regarding question #3: I'm not sure, but I don't think this definition of importance means that you could make the statement you make about 90% of the data being classified correctly by these features.

The problem

The RBF kernel function for two vectors $\mathbf{u}$ and $\mathbf{v}$ looks like this: $$ \kappa(\mathbf{u},\mathbf{v}) = \exp(-\gamma \|\mathbf{u}-\mathbf{v}\|^2). $$

Essentially, your results indicate that your values for $\gamma$ are way too high. When that happens, the kernel matrix essentially becomes the unit matrix, because $\|\mathbf{u}-\mathbf{v}\|^2$ is larger than 0 if $\mathbf{u}\neq \mathbf{v}$ and 0 otherwise which leads to kernel values of $\approx 0$ and 1 respectively when $\gamma$ is very large (consider the limit $\gamma=\infty$).

This then leads to an SVM model in which all training instances are support vectors, and this model fits the training data perfectly. Of course, when you predict a test set, all predictions will be identical to the model's bias $\rho$ because the kernel computations are all zero, i.e.: $$ f(\mathbf{z}) = \underbrace{\sum_{i\in SV} \alpha_i y_i \kappa(\mathbf{x}_i, \mathbf{z})}_{always\ 0} + \rho, $$ where $\mathbf{x}_i$ is the ith support vector and $\alpha_i$ is its corresponding dual weight.

The solution

Your search space needs to be expanded to far lower values of $\gamma$. Typically we use exponential grids, e.g. $10^{lb} \leq \gamma \leq 10^{ub}$, where the bounds are data dependent (e.g. $[-8, 2]$).

I suspect you're using grid search at the moment, which is a very poor way to optimize hyperparameters because it wastes most of the time investigating hyperparameters that aren't good for your problem.

It's far better to use optimizers that are designed for such problems, which are available in libraries like Optunity and Hyperopt. I'm the main developer of Optunity, you can find an example that does exactly what you need (i.e., tune a sklearn SVC) in our documentation.

Solved – Being able to detect the important features sklearn.make_classification generates

I have to admit, I initially thought the chi2 and f_classif may be the culprits. I therefore quickly wrote the functions below:

One looking at feature importances calculated by the random forest classifier:

def get_rf_feat_importances(X,Y):

    from sklearn.ensemble import RandomForestClassifier
    rf = RandomForestClassifier()
    rf.fit(X, Y)

    return rf.feature_importances_

And the other plotting the Regularisation Path:

def get_LARS_Lasso_path(X,Y):

    import matplotlib.pyplot as plt
    from sklearn import linear_model
    alphas, _, coefs = linear_model.lars_path(X.values, Y.values, method='lasso', verbose=True)

    xx = np.sum(np.abs(coefs.T), axis=1)
    xx /= xx[-1]

    plt.plot(xx, coefs.T)
    ymin, ymax = plt.ylim()
    plt.vlines(xx, ymin, ymax, linestyle='dashed')
    plt.xlabel('|coef| / max|coef|')
    plt.ylabel('Coefficients')
    plt.title('LASSO Path')
    plt.axis('tight')
    plt.savefig('Lasso_Path.png')

To my surprise, these show similar results. The feature importances generated by the first one and the regularisation path generated by the second sometimes indicate the same number of informative features (especially for 2), but in most cases the informative features they indicate is less than what was provided to the make_classification function.

Answers:

First to question 2) From my two functions above, it seems like the phenomenon is not specific to chi2 or f_classif scores. What these two scores do is already explained well here, so I am not going to repeat.

1) The only thing I can think of here is that all of these methods are looking at individual feature importances of these variables. It is possible that the informative features are correlated within themselves, and accounting for one's impact in improving predictive performance may be rendering the others redundant. This is explained in this comprehensive (albeit slightly dated) review.

In Section 4.2, we introduced nested subset methods that provide a useful ranking of subsets, not of individual variables: some variables may have a low rank because they are redundant and yet be highly relevant.

Best Answer

Related Solutions

Solved – sklearn – overfitting problem

The problem

The solution

Solved – Being able to detect the important features sklearn.make_classification generates

Related Question