The problem
The RBF kernel function for two vectors $\mathbf{u}$ and $\mathbf{v}$ looks like this:
$$
\kappa(\mathbf{u},\mathbf{v}) = \exp(-\gamma \|\mathbf{u}-\mathbf{v}\|^2).
$$
Essentially, your results indicate that your values for $\gamma$ are way too high. When that happens, the kernel matrix essentially becomes the unit matrix, because $\|\mathbf{u}-\mathbf{v}\|^2$ is larger than 0 if $\mathbf{u}\neq \mathbf{v}$ and 0 otherwise which leads to kernel values of $\approx 0$ and 1 respectively when $\gamma$ is very large (consider the limit $\gamma=\infty$).
This then leads to an SVM model in which all training instances are support vectors, and this model fits the training data perfectly. Of course, when you predict a test set, all predictions will be identical to the model's bias $\rho$ because the kernel computations are all zero, i.e.:
$$
f(\mathbf{z}) = \underbrace{\sum_{i\in SV} \alpha_i y_i \kappa(\mathbf{x}_i, \mathbf{z})}_{always\ 0} + \rho,
$$
where $\mathbf{x}_i$ is the ith support vector and $\alpha_i$ is its corresponding dual weight.
The solution
Your search space needs to be expanded to far lower values of $\gamma$. Typically we use exponential grids, e.g. $10^{lb} \leq \gamma \leq 10^{ub}$, where the bounds are data dependent (e.g. $[-8, 2]$).
I suspect you're using grid search at the moment, which is a very poor way to optimize hyperparameters because it wastes most of the time investigating hyperparameters that aren't good for your problem.
It's far better to use optimizers that are designed for such problems, which are available in libraries like Optunity and Hyperopt. I'm the main developer of Optunity, you can find an example that does exactly what you need (i.e., tune a sklearn SVC) in our documentation.
I have to admit, I initially thought the chi2
and f_classif
may be the culprits. I therefore quickly wrote the functions below:
One looking at feature importances calculated by the random forest classifier:
def get_rf_feat_importances(X,Y):
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X, Y)
return rf.feature_importances_
And the other plotting the Regularisation Path:
def get_LARS_Lasso_path(X,Y):
import matplotlib.pyplot as plt
from sklearn import linear_model
alphas, _, coefs = linear_model.lars_path(X.values, Y.values, method='lasso', verbose=True)
xx = np.sum(np.abs(coefs.T), axis=1)
xx /= xx[-1]
plt.plot(xx, coefs.T)
ymin, ymax = plt.ylim()
plt.vlines(xx, ymin, ymax, linestyle='dashed')
plt.xlabel('|coef| / max|coef|')
plt.ylabel('Coefficients')
plt.title('LASSO Path')
plt.axis('tight')
plt.savefig('Lasso_Path.png')
To my surprise, these show similar results. The feature importances generated by the first one and the regularisation path generated by the second sometimes indicate the same number of informative features (especially for 2), but in most cases the informative features they indicate is less than what was provided to the make_classification
function.
Answers:
First to question 2) From my two functions above, it seems like the phenomenon is not specific to chi2
or f_classif
scores. What these two scores do is already explained well here, so I am not going to repeat.
1) The only thing I can think of here is that all of these methods are looking at individual feature importances of these variables. It is possible that the informative features are correlated within themselves, and accounting for one's impact in improving predictive performance may be rendering the others redundant. This is explained in this comprehensive (albeit slightly dated) review.
In Section 4.2, we introduced nested subset methods that provide a
useful ranking of subsets, not of individual variables: some variables
may have a low rank because they are redundant and yet be highly
relevant.
Best Answer
The answer to this question describes how feature importances are computed in sklearn. Maybe it will help you with your questions #1 and #3.
Regarding question #1: It does not seem that this definition of importance is explicitly related to statistical significance.
Regarding question #2: You could still report the feature importances reported by sklearn, but they would be defined relative to your augmented data set and therefore seem more difficult to interpret. Using this augmented data set would at least change the "proportion of samples reaching that node" portion of the importance score equation (when compared with the original data set). By the way, I've had good results using SMOTE over-sampling of the minority class, combined with under-sampling of the majority class, when dealing with imbalanced data sets.
Regarding question #3: I'm not sure, but I don't think this definition of importance means that you could make the statement you make about 90% of the data being classified correctly by these features.