Adaptive LASSO – Confidence Interval and Sample Size

categorical dataglmnetlassoregressionsample-size

I have almost no experience with math or stat, but I am trying to run an Adaptive LASSO on a continuous outcome with around 200 cases and a list of around 19 variables. Some of these variables have 3 categories. My questions are:

  1. Is the sample large enough to use an adaptive lasso?
  2. Do I necessarily have to have the train and test data? or I can run the adaptive lasso on all the 200 cases?
  3. Also, my other question is on the variables with more than 2 categories. How does adaptive lasso interpret that?
  4. How can we get confidence intervals for coefficients? Does that even make sense?

Best Answer

The penalization of coefficients with methods like lasso, adaptive lasso, and ridge regression means that you can model data even when the number of predictors exceeds the number of observations. You certainly have enough to use adaptive lasso, although this doesn't mean that the results will necessarily be as good as you might find with a larger data set.

If you had 100 times as many cases you might consider train/test splits. That only leads to trouble with data sets of this scale. You can validate your model-building process by repeating it on multiple bootstrap samples of your data and evaluating those models on the full data set.

Categorical predictors have to be handled carefully in penalized regressions, although there might be some simplification with adaptive lasso.

First, with standard lasso and ridge regression you want all predictors to be on comparable scales because you penalize all regression coefficients equally according to their magnitudes (lasso) or squared magnitudes (ridge). For continuous predictors that's accomplished via scaling to unit variance. But there's no single simple way to put categorical predictors into comparable scales versus each other or versus continuous predictors. The extra weighting of coefficient magnitudes in adaptive lasso inversely to initial estimate magnitudes might tend to minimize that problem.

Second, simple lasso by itself doesn't know that multiple regression coefficients correspond to the same multi-category predictor. You can specify that with the group lasso. So if you want all coefficients associated with a multi-category predictor to be retained or excluded together you would need to use an adaptive group lasso.

Related Question