Solved – How to select a subset of variables from the original long list in order to perform logistic regression analysis

feature selectionlogisticrspss

My situation:

  • small sample size: 116
  • binary outcome variable
  • long list of explanatory variables: 44
  • explanatory variables did not come from the top of my head; their choice was based on the literature.

Statistical test chosen: logistic regression

I need to find the variables that best explain variations in the outcome variable (I am not interested in making predictions).

The problem: This question is a follow-up on the 2 questions listed below. From them, I got that performing automated stepwise regression has its downsides. Anyway, it seems that my sample size would be too small for that. It seems that my sample is also too small to enter all variables at once (using the SPSS 'Enter' method). This leaves me with my issue unresolved: how can I select a subset of variables from my original long list in order to perform multivariate logistic regression analysis?

UPDATE1: I am not an statistician, so I would appreciate if jargons can be reduced to the minimum. I am working with SPSS and am not familiar with other packages, so options that could be run with that software would be highly preferable.

UPDATE2: It seems that SPSS does not support LASSO for logistic regression. So following one of your suggestions, I am now struggling with R. I have passed through the basics, and managed to run a univariate logistic regression routine successfully using the glm code. But as I tried glmnet with the same dataset, I am receiving an error message. How could I fix it? Below is the code I used, followed by the error message:

data1 <- read.table("C:\\\data1.csv",header=TRUE,sep=";",na.string=99:9999)

y <- data1[,1]

x <- data1[,2:45]

glmnet(x,y,family="binomial",alpha=1)  

**in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  : 
(list) object cannot be coerced to type 'double'**

UPDATE3: I got another error message, now related to missing values. My question concerning that matter is here.

Best Answer

You can perform selection and logistic regression simultaneously using the LASSO or Elastic Net regression algorithms. The basic idea behind LASSO is to solve the $l_1$-penalized optimization problem $$\min_{\beta} \{ l(\beta) + \lambda||\beta||_1 \},$$ where $l(\cdot)$ is the likelihood function. Popular implementations, e.g. glmnet, efficiently solve for a grid of $\lambda$ values. This is useful because we usually don't know $\lambda$ a priori and need to apply some type of cross-validation. If you have correlated features then it helps to add some $l_2$ (ridge) penalty, which is the idea behind the Elastic Net.

Since you don't have a lot of data, I think this is probably your best bet. If you want to use a separate variable selection stage you will need to choose a metric (e.g. deviance of single-variable regression) and also a threshold. The LASSO gives you only one parameter to tune and operates within the context of multivariable logistic regression models directly.

EDIT: The question now specifically requests an approach that is implemented in SPSS. As I don't have/use that software I don't know whether lasso logistic regression is implemented. Perhaps someone can let us know in the comments.