I'm new to feature selection and I was wondering how you would use PCA to perform feature selection. Does PCA compute a relative score for each input variable that you can use to filter out noninformative input variables? Basically, I want to be able to order the original features in the data by variance or amount of information contained.
PCA for Feature Selection – Using Principal Component Analysis Effectively
feature selectionpca
Related Solutions
I suggest reviewing the literature / work by the mixmod group and by Raftery's group. Both have methods for model-based clustering involving both feature selection and without feature selection. Heuristic based methods may be appropriate for your but the performance of both heuristics, and the model based methods, tend to be highly influenced by your data inputs and your data pre-processing (as below).
Typically in a business case, you have variables from many different distributions. This poses problems in mixture modeling; and, you have not specified (a) if this is (or isn't) the case in your data, and (b) (if so) how you wish to deal with it. Another concern is how knowledgeable you are about your data. How confident are you that you can actually select the most important features?
Questions
- What types of variables do you have? What are there distributions?
- What is there correlation structure (you mentioned poor results, without detail, from PCA)?
- How are you pre-processing your variables?
If you provide additional detail on your data, a more complete answer can be provided.
There is no standard variable selection method for random forests (RF). The absolute variable importance values have no meaning, but their relative sizes can be useful to comparing different predictors. Deciding how many variables to include can be a little subjective, so many authors have proposed several variable selection algorithms. A few articles are given below:
For microarray data, Diaz-Uriarte and Alvarez de Andres [1] suggest reiteratively fitting RFs discarding 20% of variables with the smallest variable importance and choosing the variables that give the smallest out-of-bag (OOB) error rate. Genuer et al. [2] recommend a preliminary elimination that removes variables whose importance is below the minimum prediction value given by a CART model. After a preliminary elimination, a nested collection of RF models or a sequence of RF models is used to select the variables (see paper). Ishwaran et al. [3] propose a new metric called minimal depth which can select variables since the exact distribution is known. The three papers aforementioned have R packages called varSelRF
, VSURF
, and randomForestSRC
, respectively. These articles are a small subset of the literature addressing variable selection using RFs.
As a side note, I believe the blog does not use the standard approach to calculate permuted variable importance. I do not know Python that well, but it seems the code permutes each variable in the training sample and compares the permuted and non-permuted prediction error from the random forest. The standard approach is to permute the variables in the OOB sample and compare the permuted and non-permuted prediction error in each tree. The final permuted variable importance is the average difference in prediction error. I personally would suggest using R as there are more tools already available for variable selection using RFs
[1] R. Diaz-Uriarte and S. Alvarez de Andres (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics
[2] R. Geneur, J.-M. Poggi, C. Tuleau-Malot (2010) Variable selection using Random Forests. Pattern Recognition Letters
[3] H. Ishwaran, U.B. Kogalur, E.Z. Gorodeski, A.J. Minn, M.S. Lauer (2010) High-dimensional variable selection for survival data. Journal of the American Statistical Association
Best Answer
The basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings). You may recall that PCA seeks to replace $p$ (more or less correlated) variables by $k<p$ uncorrelated linear combinations (projections) of the original variables. Let us ignore how to choose an optimal $k$ for the problem at hand. Those $k$ principal components are ranked by importance through their explained variance, and each variable contributes with varying degree to each component. Using the largest variance criteria would be akin to feature extraction, where principal component are used as new features, instead of the original variables. However, we can decide to keep only the first component and select the $j<p$ variables that have the highest absolute coefficient; the number $j$ might be based on the proportion of the number of variables (e.g., keep only the top 10% of the $p$ variables), or a fixed cutoff (e.g., considering a threshold on the normalized coefficients). This approach bears some resemblance with the Lasso operator in penalized regression (or PLS regression). Neither the value of $j$, nor the number of components to retain are obvious choices, though.
The problem with using PCA is that (1) measurements from all of the original variables are used in the projection to the lower dimensional space, (2) only linear relationships are considered, and (3) PCA or SVD-based methods, as well as univariate screening methods (t-test, correlation, etc.), do not take into account the potential multivariate nature of the data structure (e.g., higher order interaction between variables).
About point 1, some more elaborate screening methods have been proposed, for example principal feature analysis or stepwise method, like the one used for 'gene shaving' in gene expression studies. Also, sparse PCA might be used to perform dimension reduction and variable selection based on the resulting variable loadings. About point 2, it is possible to use kernel PCA (using the kernel trick) if one needs to embed nonlinear relationships into a lower dimensional space. Decision trees, or better the random forest algorithm, are probably better able to solve Point 3. The latter allows to derive Gini- or permutation-based measures of variable importance.
A last point: If you intend to perform feature selection before applying a classification or regression model, be sure to cross-validate the whole process (see ยง7.10.2 of the Elements of Statistical Learning, or Ambroise and McLachlan, 2002).
As you seem to be interested in R solution, I would recommend taking a look at the caret package which includes a lot of handy functions for data preprocessing and variable selection in a classification or regression context.