Solved – Testing predictive power of a set of features

bioinformaticscorrelationfeature selectionpredictive-modelsregression

This is perhaps a typical setup in Bioinformatics: we need to build a model to predict a dependent biological variable (say $y$), given a large set of (usually genomic) features $X$ (which might not all be relevant). The size of the training set, $n$, is usually much smaller than the number of (potential) features, $|X|$. The trick is then to find the right set of features which most likely be the most predictive of the response variable $y$. Now, suppose using some prior biological knowledge and pre-processing, I have selected a subset of $X$, say $X'$ that is deemed to be the more relevant features to predict $y$ (note that: in this case, we still have $|X'|>n$). I want to know how predictive it is against $y$ in the training set. Of course, I can select a certain statistical/machine learning method, say stepwise regression or LASSO (or any other regression method that can handle $|X'|> n$), and then do cross-validation within the training set to get an idea of how predictive $X'$ is towards $y$. However, I was wondering whether I can do this in a non-model-specific way, i.e., a general metric to say how predictive $X'$ is for any model. I don't think simple correlation really captures what I need, since there can be many high-correlating variables which might not really be good predictors (spurious correlation).

Best Answer

I can direct you to the Fast Feature Selection based on R^2-values, based on Pearson Correlation Coefficient. This method was introduced for multidimensional MEG or EEG data, however it suits any binary machine learning problem. Put simply: it computes correlations of data and labels, then sorts obtained values, finally only the best scoring features are selected.

I implemented the approach for MATLAB and LIBSVM found at github. In my implementation you can define the amount of features to be selected, e.g. 100, 1000 or everything above the mean of scoring values.