Solved – How to analyze elastic net fitted model coefficients

elastic netfeature selectionlogisticr

SOLVED: an elastic net model, as any other logistic regression model, will not generate more coefficients than input variables. Check Zach's answer to understand how from an (apparent) low number of inputs, more coefficients can be generated. The cause of this question was a code bug, as the users pointed out.

This is a simple question. I've fitted a model with 1334 variables using elastic net to perform feature selection and regularization. I'm now trying to interpret the obtained coefficients in order to find correlations between the input variables and the output. The only problem is that instead of the (expected) 1335 coefficients (intercept+1334), extracting the coefficients through coef(model,s="lambda.min") yields around 1390 coefficients. This seems highly counterintuitive and stops me from mapping a single coefficient to a single input variable, so I suppose I'm not understanding some of the insides of the elastic net. Any idea would be very helpful. Thanks in advance.

PS: just in case someone wonders so, I've not included interaction terms nor any synthetic variable, just the original 1334 ones.

PS2: elastic net references:

PS3: about the code used to fit the model:

it is a 250 line script, so unless you specifically need it, I think it'd only clutter the question. Basically, the algorithm takes as an input a data frame of 1393 colums, where the last one is the target variable and the first 1392 are the input variables. So, after separating those into two matrices, input and output, the actual model fitting is done in this call:

cv.glmnet(x=input_matrix,y=output_matrix,family="binomial",type.measure="auc")

If you need to, I can actually generate a reproducible file with the data I use and the whole script.

Best Answer

It's very likely you have some categorical variables in your input dataset that are being converted to dummy variables. For example, here is a model with four x variables: "Sepal.Width", "Petal.Length", "Petal,Width", and "Species", and one y variable: "Sepal.Length"

library(glmnet)
data(iris)
x <- model.matrix(Sepal.Length~., iris)[,-1]
y <- iris$Sepal.Length
m <- cv.glmnet(x, y)

Surprisingly, we get 6 coefficients, instead of the expected 5:

>coef(m, m$lambda.min)
6 x 1 sparse Matrix of class "dgCMatrix"
                           1
(Intercept)        2.1670759
Sepal.Width        0.5032347
Petal.Length       0.8137398
Petal.Width       -0.3127065
Speciesversicolor -0.6763395
Speciesvirginica  -0.9595409

However, after some investigation we realize the number of coefficients matches the number of columns in our input matrix:

> nrow(coef(m, m$lambda.min)) == ncol(x) + 1
[1] TRUE

Furthermore, this is the same number of coefficients we get from a glm model:

> t(t(coef(glm(Sepal.Length~., data=iris))))
                        [,1]
(Intercept)        2.1712663
Sepal.Width        0.4958889
Petal.Length       0.8292439
Petal.Width       -0.3151552
Speciesversicolor -0.7235620
Speciesvirginica  -1.0234978

/edit: Here is and example with binary data. Note that the glmnet model produces the same number of coefficients as a glm model, and both of them produce the expected number of coefficients. Check your code with a glm. Also note that both of these examples were reproducible: if there WAS a bug present in glmnet (or worse yet glm) my example code would provide the package authors (or the core R team) the first step in identifying and fixing the bug.

#Setup a glmnet problem
set.seed(42)
ncol <- 10
nrow <- 1000
x <- matrix(sample(c(TRUE, FALSE), ncol*nrow, replace=TRUE), ncol=ncol)
cf <- runif(ncol(x)) * sample(0:1, ncol(x), replace=TRUE)
y <- rowSums(x %*% cf) + runif(nrow(x))/10
y <- matrix(as.integer(y>=mean(y)), ncol=1)

#Fit the model
m <- cv.glmnet(x=x,y=y,family="binomial",type.measure="auc")
coef(m, m$lambda.min)

Check the number of coefficients

> nrow(coef(m, m$lambda.min)) == ncol(x) + 1
[1] TRUE

Check the number of coefficients from a glm

> nrow(coef(m, m$lambda.min)) == length(coef(glm(y~., data=data.frame(y,x), family='binomial')))
[1] TRUE
Related Question