Is Individual Coefficient Significance with Ridge or Lasso possible, when Amount of Variables exceeds Observations

high-dimensionallassoridge regressionselectiveinferencestatistical significance

First, to introduce you to my situation, I have a dataset containing n = 16 observations and p = 17 variables. My variable set contains 16 independent variables (14 variables I'm interested in and two serving as control variables) and one outcome variable. I want to perform a regression analysis to see which of the 14 independent variables is statistically significant in regards to the outcome variable.

The known problem hereby is that p > n, which makes OLS regression unusable. During research I went into the topic of regularization, in particular Shrinkage methods (e.g. Ridge/Lasso regression) or Dimension reduction, as these alternatives tackle the problem of having too little observations for too many variables. But these alternatives usually are used to model fitting or predictions, which I'm not quite focused on in my scenario.

In James et al. (2013) book An Introdcution to Stastitical Learning, they mention in a p > n setting (High-Dimensional) the problem with multicollinearity is severe, as it increases variances of coefficients, thus making usual statistical inference (tests, R² statistics, etc.) not applicable, even when applying Ridge, Lasso or other similar methods in high-dimensional setting.

Researching here on old posts, some posts (e.g. Ridge regression in R with p values and goodness of fit) show that there are methods to calculate p-values from Ridge or Lasso for individual significance of the variables, but other posts https://stats.stackexchange.com/q/276266 are neglecting the idea.

I'm confused now whether or not I can use Ridge or Lasso for my problem. Furthermore if not, are there other methods to get results for individual variable significance in a highdimensional setting, which I haven't thought of? I'm thankful for any advice I can get.

Best Answer

The main problem is that you don't have enough cases to accomplish what you seek. If the outcome variable is continuous so that you might have used ordinary least squares in the $n>p$ situation, you generally need about 15 cases per predictor that you are evaluating to get reliable results that might apply to another data sample from the same population. If you already have two predictors serving as "control variables" you would already be pushing the limit even if you had $n>p$ with $n=16$.

Yes, you can in principle use the penalization provided by either ridge or LASSO to deal with the $p>n$ problem. But you should be very careful in trying to evaluate "statistical significance" of any coefficient estimates you get. Chapter 6 of Statistical Learning with Sparsity goes into detail about the issues with LASSO. You can get significance estimates for a fixed choice of penalty parameter, but it's not clear to me how you take into account use of the data to choose that penalty. My answer here, which you cite as supporting significance testing for ridge and LASSO, did suggest a particular form of bootstrapping for ridge (not LASSO) as a possibility, but mostly argued against trying to evaluate anything other than predictive ability.

With so few cases your best bet would be ridge, which will include all the predictors and thus avoid the variable-selection problem with LASSO, which is likely to be unstable among data samples in terms of the particular predictors returned. You should repeat the modeling with multiple bootstrap samples to show yourself how variable the results can be, depending on the data sample. But with so few cases you shouldn't put much faith in the results of any "significance" test, and instead use these data as a pilot study to inform further work.

For further guidance, read Frank Harrell's course notes, with particular attention to Chapter 4 on Multivariable Modeling Strategies.

Related Question