SOLVED: an elastic net model, as any other logistic regression model, will not generate more coefficients than input variables. Check Zach's answer to understand how from an (apparent) low number of inputs, more coefficients can be generated. The cause of this question was a code bug, as the users pointed out.
This is a simple question. I've fitted a model with 1334 variables using elastic net to perform feature selection and regularization. I'm now trying to interpret the obtained coefficients in order to find correlations between the input variables and the output. The only problem is that instead of the (expected) 1335 coefficients (intercept+1334), extracting the coefficients through coef(model,s="lambda.min")
yields around 1390 coefficients. This seems highly counterintuitive and stops me from mapping a single coefficient to a single input variable, so I suppose I'm not understanding some of the insides of the elastic net. Any idea would be very helpful. Thanks in advance.
PS: just in case someone wonders so, I've not included interaction terms nor any synthetic variable, just the original 1334 ones.
PS2: elastic net references:
- Mathematical paper: http://web.stanford.edu/~hastie/Papers/B67.2%20(2005)%20301-320%20Zou%20&%20Hastie.pdf
- R package tutorial: http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
PS3: about the code used to fit the model:
it is a 250 line script, so unless you specifically need it, I think it'd only clutter the question. Basically, the algorithm takes as an input a data frame of 1393 colums, where the last one is the target variable and the first 1392 are the input variables. So, after separating those into two matrices, input and output, the actual model fitting is done in this call:
cv.glmnet(x=input_matrix,y=output_matrix,family="binomial",type.measure="auc")
If you need to, I can actually generate a reproducible file with the data I use and the whole script.
Best Answer
It's very likely you have some categorical variables in your input dataset that are being converted to dummy variables. For example, here is a model with four x variables: "Sepal.Width", "Petal.Length", "Petal,Width", and "Species", and one y variable: "Sepal.Length"
Surprisingly, we get 6 coefficients, instead of the expected 5:
However, after some investigation we realize the number of coefficients matches the number of columns in our input matrix:
Furthermore, this is the same number of coefficients we get from a glm model:
/edit: Here is and example with binary data. Note that the glmnet model produces the same number of coefficients as a glm model, and both of them produce the expected number of coefficients. Check your code with a glm. Also note that both of these examples were reproducible: if there WAS a bug present in
glmnet
(or worse yetglm
) my example code would provide the package authors (or the core R team) the first step in identifying and fixing the bug.Check the number of coefficients
Check the number of coefficients from a glm