LASSO Regression Using glmnet for Binary Outcome – Practical Example

lassorself-study

I am starting to dabble with the use of glmnet with LASSO Regression where my outcome of interest is dichotomous. I have created a small mock data frame below:

age     <- c(4, 8, 7, 12, 6, 9, 10, 14, 7) 
gender  <- c(1, 0, 1, 1, 1, 0, 1, 0, 0)
bmi_p   <- c(0.86, 0.45, 0.99, 0.84, 0.85, 0.67, 0.91, 0.29, 0.88)
m_edu   <- c(0, 1, 1, 2, 2, 3, 2, 0, 1)
p_edu   <- c(0, 2, 2, 2, 2, 3, 2, 0, 0)
f_color <- c("blue", "blue", "yellow", "red", "red", "yellow", "yellow", 
             "red", "yellow")
asthma  <- c(1, 1, 0, 1, 0, 0, 0, 1, 1)
# df is a data frame for further use!
df <- data.frame(age, gender, bmi_p, m_edu, p_edu, f_color, asthma)

The columns (variables) in the above dataset are as follows:

age (age of child in years) – continuous
gender – binary (1 = male; 0 = female)
bmi_p (BMI percentile) – continuous
m_edu (mother highest education level) – ordinal (0 = less than high school; 1 = high school diploma; 2 = bachelors degree; 3 = post-baccalaureate degree)
p_edu (father highest education level) – ordinal (same as m_edu)
f_color (favorite primary color) – nominal ("blue", "red", or "yellow")
asthma (child asthma status) – binary (1 = asthma; 0 = no asthma)

The goal of this example is to make use of LASSO to create a model predicting child asthma status from the list of 6 potential predictor variables (age, gender, bmi_p, m_edu, p_edu, and f_color). Obviously the sample size is an issue here, but I am hoping to gain more insight into how to handle the different types of variables (i.e., continuous, ordinal, nominal, and binary) within the glmnet framework when the outcome is binary (1 = asthma; 0 = no asthma).

As such, would anyone being willing to provide a sample R script along with explanations for this mock example using LASSO with the above data to predict asthma status? Although very basic, I know I, and likely many others on CV, would greatly appreciate this!

Best Answer

library(glmnet)

age     <- c(4, 8, 7, 12, 6, 9, 10, 14, 7) 
gender  <- as.factor(c(1, 0, 1, 1, 1, 0, 1, 0, 0))
bmi_p   <- c(0.86, 0.45, 0.99, 0.84, 0.85, 0.67, 0.91, 0.29, 0.88) 
m_edu   <- as.factor(c(0, 1, 1, 2, 2, 3, 2, 0, 1))
p_edu   <- as.factor(c(0, 2, 2, 2, 2, 3, 2, 0, 0))
f_color <- as.factor(c("blue", "blue", "yellow", "red", "red", "yellow", 
                       "yellow", "red", "yellow"))
asthma <- c(1, 1, 0, 1, 0, 0, 0, 1, 1)

xfactors <- model.matrix(asthma ~ gender + m_edu + p_edu + f_color)[, -1]
x        <- as.matrix(data.frame(age, bmi_p, xfactors))

# Note alpha=1 for lasso only and can blend with ridge penalty down to
# alpha=0 ridge only.
glmmod <- glmnet(x, y=as.factor(asthma), alpha=1, family="binomial")

# Plot variable coefficients vs. shrinkage parameter lambda.
plot(glmmod, xvar="lambda")

enter image description here

Categorical variables are usually first transformed into factors, then a dummy variable matrix of predictors is created and along with the continuous predictors, is passed to the model. Keep in mind, glmnet uses both ridge and lasso penalties, but can be set to either alone.

Some results:

# Model shown for lambda up to first 3 selected variables.
# Lambda can have manual tuning grid for wider range.

glmmod
# Call:  glmnet(x = x, y = as.factor(asthma), family = "binomial", alpha = 1) 
# 
#        Df    %Dev   Lambda
#   [1,]  0 0.00000 0.273300
#   [2,]  1 0.01955 0.260900
#   [3,]  1 0.03737 0.249000
#   [4,]  1 0.05362 0.237700
#   [5,]  1 0.06847 0.226900
#   [6,]  1 0.08204 0.216600
#   [7,]  1 0.09445 0.206700
#   [8,]  1 0.10580 0.197300
#   [9,]  1 0.11620 0.188400
#  [10,]  3 0.13120 0.179800
#  [11,]  3 0.15390 0.171600
# ...

Coefficients can be extracted from the glmmod. Here shown with 3 variables selected.

coef(glmmod)[, 10]
#   (Intercept)           age         bmi_p       gender1        m_edu1 
#    0.59445647    0.00000000    0.00000000   -0.01893607    0.00000000 
#        m_edu2        m_edu3        p_edu2        p_edu3    f_colorred 
#    0.00000000    0.00000000   -0.01882883    0.00000000    0.00000000 
# f_coloryellow 
#   -0.77207831

Lastly, cross validation can also be used to select lambda.

cv.glmmod <- cv.glmnet(x, y=asthma, alpha=1)
plot(cv.glmmod)

enter image description here

(best.lambda <- cv.glmmod$lambda.min)
# [1] 0.2732972

Related Solutions

Solved – Using glmnet to solve the LASSO problem

glmnet is more setup to solve the first formulation of the L1 penalty instead of the size constraint. glmnet has a fairly unique method of solving the problem, but is probably the best out there. For most applications, you just feed it your design matrix in the x parameter and the response in the y parameter. The x needs to already have factors built into dummies/contrasts. I would usually recommend doing full dummy coding instead of the default treatment coding unless you have a true reference category to shrink towards.

You need to then set your alpha to choose the balance between Lasso vs Ridge. Alpha = 1 will give you full LASSO. Usually you would want alpha=0.95 to have a touch of ridging to handle any lingering multicollinearity and just give a bit better conditioning to the problem in general. If you're more focused on predictive accuracy than parsimony then setting alpha much further towards the ridge side of things is usually better (Alpha ~ 0.05).

glmnet will then fit a whole string of lambda values. It starts with the smallest lambda that produces an intercept only model and then slowly lowers the lambda by ~8 orders of magnitude, making a handful of discrete estimates along the way. You can control how deep it goes and how many stops via some lambda specific parameters.

Lastly, you'll probably want to use the cv.glmnet function to pick a good value for lambda. The cross validation will give you a good estimate of lambda to use when making predictions on new data. Make sure the resulting lambda sequence reaches a minimum. It usually produces a nice visual of the bias-variance tradeoff, but sometimes you'll have to create a longer sequence of lambdas if not much penalty is really needed. Just be sure your custom sequence starts with an intercept-only model and progresses smoothly like the default sequence.

Of minor notes, do be sure you choose the correct distribution for your response. And do know that the design matrix can be passed in sparse format if you have some factors with many levels. And read the useful vignettes available on the CRAN glmnet webpage. Good luck!

Update - Dummy Coding Contrast snippet:

There is likely a better way, but this should get the job done. This changes the default factor encoding in formulas to be full dummy coding:

> orig.contrasts <- options('contrasts')[[1]]

> contr.dummy <- function(...,contrasts=captured) contr.treatment(...,contrasts=FALSE)

> cust.contrasts <- c('contr.dummy','contr.poly')

> names(cust.contrasts) <- names(orig.contrasts)

> options(contrasts=cust.contrasts)

> options('contrasts')

Solved – model selection through shrinkage (Lasso) using glmnet

Try replacing

x <- as.matrix(subset(d, select=Cs(rx, sex, age, obstruct, perfor, adhere, surg)))

with

x <- model.matrix( ~ rx + sex + age + obstruct + perfor + adhere + surg - 1, d)

The latter will convert the convert the factor rx to numeric dummy variables in the matrix x.

Best Answer

Related Solutions

Solved – Using glmnet to solve the LASSO problem

Solved – model selection through shrinkage (Lasso) using glmnet

Related Question