R – Get Odds Ratios with Confidence Intervals from a Lasso Regression Model

lassorregressionregression coefficients

I try to understand lasso regression. So far, I do understand that it can be used to shrink regression coefficients in case of few events. The coefficients of some covariates are even shrunk to zero. By shrinking coefficients, the issue of overfitting is addressed. That is how I understand this paper: BMJ. 2016 Jun 8;353:i3235.

I want to fit a logistic regression model to predict a future event. Unfortunately, available data is sparse and we have only 40 events. I thought to use lasso regression instead of stepwise backward selection this time. However, I do not know how to get odds ratios with respective 95% CIs for the covariates retained in the lasso regression model?

I found a working example on this web page, which I modified for my question:
http://www.sthda.com/english/articles/36-classification-methods-essentials/149-penalized-logistic-regression-essentials-in-r-ridge-lasso-and-elastic-net/

Here is my example code:

library(dplyr)
library(caret)
library(glmnet)

set.seed(123) 

#####################################
# Load the data and remove NAs
#####################################
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)

# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)

# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

# Dummy code categorical predictor variables
x <- model.matrix(diabetes~., train.data)[,-1]
x
head(x)

# Convert the outcome (class) to a numerical variable
y <- ifelse(train.data$diabetes == "pos", 1, 0)

#####################################
# Find the best lambda using cross-validation
#####################################
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
cv.lasso

#####################################
# Final model with lambda.min
#####################################
lasso.model <- glmnet(x, y, alpha = 1, family = "binomial", lambda = cv.lasso$lambda.min)
lasso.model
coef(cv.lasso, cv.lasso$lambda.min)

#####################################
# Standard model
#####################################
mylogit <- glm(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age, data = train.data, family = "binomial")
summary(mylogit)

#                   lasso     standard
# (Intercept) -7.57530592     -9.503717
# pregnant     0.02027797     0.045710
# glucose      0.03350748     0.042303
# pressure     .              -0.007004      
# triceps      0.01349090     0.018578
# insulin      .              -0.001592
# mass         0.02172361     0.045017
# pedigree     0.57491081     0.968452
# age          0.03302719     0.042557

#####################################
# Get OR plus 95% CI from standard model
#####################################
exp(cbind(OR = coef(mylogit), confint(mylogit)))

Best Answer

I think there is a misunderstanding here: regularized regression methods like lasso & ridge regression are for predicting individual data points (not used in training the coefficients), not for inference on the coefficients. The very purpose of regularisation in lasso is to bias the coefficients towards zero in order to get a better predictive performance. See this discussion on the difference between predictive and inferential statistics.

In case you research goal is to interpret your coefficients, you may resort to regular logistic regression. Given the problem of a small-sample size and a large number of potential predictors I see two established solutions:

  1. Conclude that your inferences are exploratory (meant for guiding follow-up research, not making conclusions about the phenomenon in question). You may pre-select predictors based on theory, but you should say that you did so when reporting your results.
  2. Get more data or even better a separate validation sample to test the robustness of your initial model.

PS: In any case avoid step-wise procedures at all costs.

Related Question