Solved – How to use weight vector of SVM and logistic regression for feature importance

feature selectionlogisticmachine learningsvmt-test

I have trained a SVM and logistic regression classifier on my dataset for binary classification. Both classifier provide a weight vector which is of the size of the number of features. I can use this weight vector to select the 10 most important features. For doing that I have turned the weights into t-scores by doing a permutation test. I did 1000 permutations of the class labels and at each permutation I calculated the weight vector. In the end I subtracted the mean of the permuted weights from the real weights and divided by the standard deviation of the permuted weights. So I have now t-scores.

Should I use the absolute values of the t-scores, i.e. selecting the 10 features with the highest absolute values? So let's say the features have the following t-scores:

feature 1: 1.3
feature 2: -1.7
feature 3: 1.1
feature 4: -0.5

If I select the 2 most important features by considering the highest absolute values, feature 1 and 2 would win. If I consider not the absolute values, feature 1 and 3 would win.

Second, this only works for SVM with linear kernel but not with RBF kernel as I have read. For non-linear kernel the weights are somehow no more linear. What is the exact reason that the weight vector cannot be used to determine the importance of features in case of non-linear kernel SVM?

Best Answer

1) Assuming you have properly pre-processed your data then I would consider the absolute value of the weight. Negative value just means that it has a negative impact on the outcome, but a large negative weight is still significant. (note that this does not hold if the data is not standardized)

2) If you are using a non linear kernel then the weight only make sense in the higher dimensional space in which the kernel exists. In the case of the RBF kernel this space has infinite dimension which makes your life harder. If you were using a polynomial kernel, then the weights would still be useful, but some weights would represent power terms or interactions terms. Have a look at this post

How to intuitively explain what a kernel is?

Related Question