Solved – Feature Selection Using Principal Feature Analysis and Variables Factor Map

feature selectionpcapythonr

I am trying to select the most important features that explain the variability of my data using an unsupervised approach in python (would consider R though).

This is after I performed a PCA and looked at the PC1 and PC2, and look at the Variables Factor Map.

The Variables Factor Map tells me what is the contribution of each of the variables to the components; in this case PC1 and PC2 which carry most of the variation of the data.

If I apply a Principal Feature Selection Analysis 10,000 times and rank the features by how many times each feature is selected among the most important (top 10) features, the features that dominate the PC1 and PC2 are at the bottom half, i.e. seem to be not very important and vice versa.

If I take only the top half features and perform the PCA, it looks very similar to the original, but if I used the bottom half it looks different, i.e. the Feature Selection Analysis seems to be doing its job, however I do not understand why the features that contribute the most to the PC1 and PC2 seem to be the ones that are the least important! This seems unintuitive to me, and make me think I am missing something.

Could anyone help me to understand this issue?

Thanks.

Best Answer

I have tried a similar approach and am pretty sure that the kmeans clustering step is causing issues. The selection of features heavily depends on the random state and I'm quite sure that is the problem (at least in my case). It sounds like the feature selection is also slightly random in your case so that might be causing these weird results. Hope that helps!

Related Solutions

PCA for Feature Selection – Using Principal Component Analysis Effectively

The basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings). You may recall that PCA seeks to replace $p$ (more or less correlated) variables by $k<p$ uncorrelated linear combinations (projections) of the original variables. Let us ignore how to choose an optimal $k$ for the problem at hand. Those $k$ principal components are ranked by importance through their explained variance, and each variable contributes with varying degree to each component. Using the largest variance criteria would be akin to feature extraction, where principal component are used as new features, instead of the original variables. However, we can decide to keep only the first component and select the $j<p$ variables that have the highest absolute coefficient; the number $j$ might be based on the proportion of the number of variables (e.g., keep only the top 10% of the $p$ variables), or a fixed cutoff (e.g., considering a threshold on the normalized coefficients). This approach bears some resemblance with the Lasso operator in penalized regression (or PLS regression). Neither the value of $j$, nor the number of components to retain are obvious choices, though.

The problem with using PCA is that (1) measurements from all of the original variables are used in the projection to the lower dimensional space, (2) only linear relationships are considered, and (3) PCA or SVD-based methods, as well as univariate screening methods (t-test, correlation, etc.), do not take into account the potential multivariate nature of the data structure (e.g., higher order interaction between variables).

About point 1, some more elaborate screening methods have been proposed, for example principal feature analysis or stepwise method, like the one used for 'gene shaving' in gene expression studies. Also, sparse PCA might be used to perform dimension reduction and variable selection based on the resulting variable loadings. About point 2, it is possible to use kernel PCA (using the kernel trick) if one needs to embed nonlinear relationships into a lower dimensional space. Decision trees, or better the random forest algorithm, are probably better able to solve Point 3. The latter allows to derive Gini- or permutation-based measures of variable importance.

A last point: If you intend to perform feature selection before applying a classification or regression model, be sure to cross-validate the whole process (see §7.10.2 of the Elements of Statistical Learning, or Ambroise and McLachlan, 2002).

As you seem to be interested in R solution, I would recommend taking a look at the caret package which includes a lot of handy functions for data preprocessing and variable selection in a classification or regression context.

Feature Selection and Cross-Validation – Techniques and Best Practices

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).

The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).

HTH

Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:

Biased estimator: erate = 0.429210 (0.397683 - 0.451737)

Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)

The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.

NF    = 56;
NC    = 259;
NFOLD = 10;
NMC   = 1e+4;

% perform Monte-Carlo simulation of biased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform feature selection

   err       = mean(repmat(y,1,NF) ~= x);
   [err,idx] = min(err);

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, '  Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

% perform Monte-Carlo simulation of unbiased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      % perform feature selection

      err       = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
      [err,idx] = min(err);

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

Best Answer

Related Solutions

PCA for Feature Selection – Using Principal Component Analysis Effectively

Feature Selection and Cross-Validation – Techniques and Best Practices

Related Question