Solved – How to tell if I the sample size is large enough for reliable feature selection using LASSO regression

lassosmall-sample

I have a gene expression dataset with 20 samples, and am not going to be getting any more. There are ~28,000 genes and four clinical covariates associated with each sample. The gene expression values have been Blom transformed and are therefore standard normal distributed. The clinical covariates are normally distributed.

I am attempting to identify a sparse set of biologically relevant genes whose expression predicts each clinical covariate. The LASSO and its extensions seem attractive and popular, but I noticed the selected variables are highly unstable between cross validation runs. I generated 1000 bootstrap samples and applied the LASSO using R's glmnet package, recording how often each of my genes was selected in the optimal model. For one of my covariates, the most common gene appeared in only 58% of the bootstrap models, whereas in another covariate the top gene appeared 98% of the time. The best models predict the covariates with very high accuracy (R^2>0.9), but the top genes don't seem to make any biological sense that I can see.

I am wondering: how I can tell if my sample size is large enough to make any reliable inferences about which genes are primarily associated with my clinical covariates? I've read that the variables selected by LASSO can be unstable when there is multicollinearity among features, which I believe is the case with my data, but does this mean that the LASSO is unsuitable for my purposes?

Best Answer

With so many variables there's almost guaranteed multicollinearity. You're getting only $n$ variables at most as well (see If p > n, the lasso selects at most n variables).


What I'm currently doing with $p \gg n$ biomedical data is what's called Attribute Bagging [1].

It's the same idea of bagging, where you build classifiers on random subsets of your data and aggregate their predictions. You get a kind of importance measure latent to bagging heuristics by aggregating.

Coincidentally, I'm using a linear model (L2-regularized L1-loss SVM to be precise) and it's giving me decent results on a regression problem I had not much hope.

That way you can use LASSO and it's advantages and still get an almost completely safe feature importance metric.


[1] Bryll, R., Gutierrez-Osuna, R., & Quek, F. (2003). Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets. Pattern recognition, 36(6), 1291-1302.