Solved – feature importance using neural network

classificationfeature selectionneural networkspython

Is it a good practice to find the absolute value of the sum of weights of the features on the first hidden layer in neural network to find the importance of features using neural network ?

Best Answer

Say that all of our features have value 1. Give features one and two weights 3 and 1, respectively--they lead to node A where they activate with 1*3+1*1=4. We also have features three and four with weights 2 each--they lead to node B and activate with 1*2 + 1*2=4. In the next layer, node A has weight 0.4 and node B has weight 0.6. Is feature one more important than both features three and four?

What if there are 7 more layers?

Often, neural networks are used in a setting where features interact so much that the concept of importance is not really clear (e.g., pixel data). There is however a lot of work on interpreting neural networks.

As far as feature importance; if the features truly have distinct importances, it might be worth using a different classifier to see it (e.g., LASSO).

With a neural network, possibly one could shuffle each feature and see what happens to predictive performance? This is one way of doing it for random forests. I have seen some recent papers where the authors, I think, kind of masked features and checked the effect. Another option suggested here is to calculate the gradient with respect to the inputs.

Related Solutions

Solved – Convergence of neural network weights

There are a number of questions to ask:

do you have the appropriate number of neurons in each layer
are you using the appropriate types of transfer functions?
are you using the appropriate type of learning algorithm
do you have a large enough sample size
can you confirm that your samples have the right sorts of relationship with each other to be informative? (not redundant, of relevant dimension, etc...)

What can you give in the way of ephemeris? Can you tell us something about the nature of the data?

You could make a gradient boosted tree of neural networks.

You asked what happens if you stop early.

You can try yourself. Run 300x where you start with random initialized weights, and then stop at a specified number of iterations, lets say 100. At that point compute your ensemble error, your training-subset error, and your test-set error. Repeat. After you have 300 values to tell you what the error is, you can get an idea of your error distribution given 100 learning iterations. If you like, you can then sample that distribution at several other values of learning. I suggest 200, 500, and 1000 iterations. This will give you an idea how your SNR changes over time. A plot of the SNR vs iteration count can give you an idea about "cliffs" or "good enough". Sometimes there are cliffs where error collapses. Sometimes the error is acceptable at that point.

It takes "relatively simple" data or "pretty good" luck for your system to consistently converge in under 100 iterations. Both of which are not about repeatability nor are they generalizable.

Why are you thinking in terms of weights converging and not error being below a particular threshold. Have you ever heard of a voting paradox? (link) When you have cyclic interactions in your system (like feedback in Neural Networks) then you can have voting paradoxes - coupled changes. I don't know if weights alone is a sufficient indicator for convergence of the network.

You can think of the weights as a space. It has more than 3 dimensions, but it is still a space. In the "centroid" of that space is your "best fit" region. Far from the centroid is a less good fit. You can think of the current setting of your weights as a single point in that space.

Now you don't know where the "good" actually is. What you do have is a local "slope". You can perform gradient descent toward local "better" given where your point is right now. It doesn't tell you the "universal" better, but local is better than nothing.

So you start iterating, walking downhill toward that valley of betterness. You iterate until you think you are done. Maybe the value of your weights are large. Maybe they are bouncing all over the place. Maybe the compute is "taking too long". You want to be done.

So how do you know whether where you are is "good enough"?

Here is a quick test that you could do:

Take 30 uniform random subsets of the data (like a few percent of the data each) and retrain the network on them. It should be much faster. Observe how long it takes them to converge and compare it with the convergence history of the big set. Test the error of the network for the entire data on these subsets and see how that distribution of errors compares to your big error. Now bump the subset sizes up to maybe 5% of your data and repeat. See what this teaches you.

This is a variation on particle swarm optimization(see reference) modeled on how honeybees make decisions based on scouting.

You asked what happens if weights do not converge.

Neural Networks are one tool. They are not the only tool. There are others. I would look at using one of them.

I work in terms of information criteria, so I look at both the weights (parameter count) and the error. You might try one of those.

There are some types of preprocessing that can be useful. Center and Scale. Rotate using principal components. If you look at the eigenvalues in your principal components you can use skree plot rules to estimate the dimension of your data. Reducing the dimension can improve convergence. If you know something about the 'underlying physics' then you can smooth or filter the data to remove noise. Sometimes convergence is about noise in the system.

I find the idea of Compressed sensing to be interesting. It can allow radical sub-sampling of some systems without loss of generalization. I would look at some bootstrap re-sampled statistics and distributions of your data to determine if and at what level of sub-sampling the training set becomes representative. This gives you some measure of the "health" of your data.

Sometimes it is a good thing that they not converge

Have you ever heard of a voting paradox? You might think of it as a higher-count cousin to a two-way impasse. It is a loop. In a 2-person voting paradox the first person wants candidate "A" while the second wants candidate "B" (or not-A or such). The important part is that you can think of it as a loop.

Loops are important in neural networks. Feedback. Recursion. It made the perceptron able to resolve XOR-like problems. It makes loops, and sometimes the loops can act like the voting paradox, where they will keep changing weights if you had infinite iterations. They aren't meant to converge because it isn't the individual weight that matters but the interaction of the weights in the loop.

Note:

Using only 500 iterations can be a problem. I have had NN's where 10,000 iterations was barely enough. The number of iterations to be "enough" is dependent, as I have already indicated, on data, NN-topology, node-transfer functions, learning/training function, and even computer hardware. You have to have a good understanding of how they all interact with your iteration count before saying that there have been "enough" or "too many" iterations. Other considerations like time, budget, and what you want to do with the NN when you are done training it should also be considered.

Chen, R. B., Chang, S. P., Wang, W., & Wong, W. K., (2011, September). Optimal Experimental Designs via Particle Swarm Optimization Methods (preprint), Retrieved March 25, 2012, from http://www.math.ntu.edu.tw/~mathlib/preprint/2011-03.pdf

Solved – Positive or negative effect of neural network inputs on output in binary classification (MATLAB)

The most common approach to find feature importance, is to employ a generalized linear model and check its performance by turning off features. The method is described here:

http://uk.mathworks.com/help/stats/feature-selection.html?refresh=true.

Another approach, that I usually prefer, is to use random forests to compute the feature importance based on their splits and OOB samples. You can find more information at:

http://uk.mathworks.com/help/stats/ensemble-methods.html#bsx62vu, and in order to calculate the variance importance, you just set the 'OOBVarImp' as 'On' in the treebagger function

If you still want to use a neural network, and given that your features are standardized, the method that you have to follow is called Sensitivity Analysis (as you have it in your tags). It does exactly what you want, but probably you will have to code it yourself. I would refer you for more information about implementing it to:

Best Answer

Related Solutions

Solved – Convergence of neural network weights

Solved – Positive or negative effect of neural network inputs on output in binary classification (MATLAB)

Related Question