High Dimensional Data – Correct Workflow for Logistic Regression and Feature Selection

dimensionality reductionfeature selectionlogisticmachine learningr

I have a cancer classification problem (type A vs type B) on radiological images from which i have generated 756 texture-based predictive features (wavelet transform followed by texture analysis, i.e., features described by Haralick, Amasadun etc) and 8 semantic features based on subjective assessment by expert radiologist. This is entirely for research and publication to show that these predictive features may be useful in this particular problem. I do not intend to deploy the model for practitioners.

I have 107 cases. 60% cases are type A and 40% type B (in keeping with their natural proportions in population). I have done several iterations of model development with varying results. One particular method is giving me an 80% 80% classification accuracy but I am suspicious that my method is not going to stand critical analysis. I am going to outline my method and a few alternatives. I will be grateful if someone can pick if it is flawed. I have used R for this:

Step 1: Split into 71 training and 36 test cases.
Step 2: remove correlated features from training dataset (766 -> 240) using findcorrelation function in R (caret package)
Step 3: rank training data features using Gini index (Corelearn package)
Step 4: Train multivariate logistic regression models on top 10 ranked features using subsets of sizes 3 , 4, 5 ,and 6 in all possible combination (10C3=252, 10C4=504, 10C5=630). So total 1386 multivariate logistic regression models were trained using 10-fold cross-validation and tested on test dataset.
Step 5: Of these I selected a model which gave the best combination of training and test dataset accuracy, i.e., 3 feature model with 80% 80% accuracy.

Somehow running 1300 permutations seems quite dodgy to me and seems to have introduced some false discovery. Just want to confirm if this is a valid ML technique or whether I should skip step 4 and only train on top 5 ranked features without running and permutations.

Thanks.

PS I experiemented a bit with naive bayes and random forests but get rubbish test set accuracy so dropped them

====================

UPDATE

Following discussion with SO members, i have changed the model drastically and thus moved more recent questions regarding model optimisation into a new post is my LASSO regularised classification method correct?

Best Answer

I see 3 potential problems with this approach. First, if you intend to use your model for classifying new cases, your variable-selection procedure might lead to a choice of variables too closely linked to peculiarities of this initial data set. Second, the training/test set approach might not be making the most efficient use of the data you have. Third, you might want to reconsider your metric for evaluating models.

First, variable selection tends to find variables that work well for a particular data set but don't generalize well. It's fascinating and frightening to take a variable selection scheme (best subset as you have done, or even LASSO) and see how much the set of selected variables differs just among bootstrap re-samples from the same data set, particularly when many predictors are inter-correlated.

For this application, where many of your predictors seem to be correlated, you might be better off taking an approach like ridge regression that treats correlated predictors together. Some initial pruning of your 766 features might still be wise (maybe better based on subject-matter knowledge than on automated selection), or you could consider an elastic net hybrid of LASSO with ridge regression to get down to a reasonable number of predictors. But when you restrict yourself to a handful of predictors you risk throwing out useful information from other potential predictors in future applications.

Second, you may be better off using the entire data set to build the model and then using bootstrapping to estimate its generalizability. For example, you could use cross-validation on the entire data set to find the best choice of penalty for ridge regression, then apply that choice to the entire data set. You would then test the quality of your model on bootstrap samples of the data set. That approach tends to maximize the information that you extract from the data, while still documenting its potential future usefulness.

Third, your focus on classification accuracy makes the hidden assumption that both types of classification errors have the same cost and that both types of classification successes have the same benefit. If you have thought hard about this issue and that is your expert opinion, OK. Otherwise, you might consider a different metric for, say, choosing the ridge-regression penalty during cross-validation. Deviance might be a more generally useful metric, so that you get the best estimates of predicted probabilities and then can later consider the cost-benefit tradeoffs in the ultimate classification scheme.

In terms of avoiding overfitting, the penalty in ridge regression means that the effective number of variables in the model can be many fewer than the number of variables nominally included. With only 42 of the least-common case you were correct to end up with only 3 features (about 15 of the least-common case per selected feature). The penalization provided by ridge regression, if chosen well by cross validation, will allow you to combine information from more features in a way that is less dependent on the peculiarities of your present data set while avoiding overfitting and being generalizable to new cases.