Apparently the stepwise produce in R is not a good way to automatically select the best glm model. Different sources suggest using lasso instead. I had a look at the glmnet packages but I do not understand how I can implement 2-way interaction terms in this packages. Also if somebody knows a webpage that explains the lasso approach in a not too mathematical way I would appreciate a lot.
Solved – GLM Interaction Lasso
generalized linear modelinteractionlasso
Related Solutions
This is not a solution but some reflections on the possibilities and difficulties that I know of.
Whenever it is possible to specify a time-series model as $$Y_{t+1} = \mathbf{x}_t \beta + \epsilon_{t+1}$$ with $\mathbf{x}_t$ computable from covariates and time-lagged observations, it is also possible to compute the least-squares elastic net penalized estimator of $\beta$ using glmnet in R. It requires that you write code to compute $\mathbf{x}_t$ to form the model matrix that is to be specified in glmnet. This works for AR-models but not directly for ARMA-models, say. Moreover, the cross-validation procedures of glmnet are not sensible per se for time-series data.
For more general models
$$Y_{t+1} = f(\mathbf{x}_t, \beta) + \epsilon_{t+1}$$
an implementation of an algorithm for computing the non-linear least-squares elastic net penalized estimator of $\beta$ is needed. To the best of my knowledge there is no such implementation in R. I am currently writing an implementation to solve the case where
$$Y_{t+1} = \mathbf{x}_t g(\beta) + \epsilon_{t+1}$$
the point being that it is paramount for model selection that the lasso penalization is on $\beta$ and not $g(\beta)$. If I recall the ARIMA-parametrization correctly it also takes this form $-$ but I cannot offer any code at the moment. It is (will be) based on A coordinate gradient descent method for nonsmooth separable minimization.
Another issue is the selection of the amount of penalization (the tuning parameters). It will generally require a form of cross-validation for time-series, but I hope to be able to work out some less computationally demanding methods for specific models.
Solved – How to handle with missing values in order to prepare data for feature selection with LASSO
When a continuous predictor $x$ contains 'not applicable' values it's often useful to code it using two variables:
$$ x_1=\Big{\{} \begin{array}{ll} c & \text{when $x$ is not applicable}\\ x & \text{otherwise} \end{array} \Bigg{.} $$
where $c$ is a constant, &
$$ x_2=\Big{\{} \begin{array}{ll} 1 & \text{when $x$ is not applicable}\\ 0 & \text{otherwise} \end{array} \Bigg{.} $$
Suppose the linear predictor for the response is given by
$$\eta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots$$
which resolves to
$$\eta = \beta_0 + \beta_1 x_1 + \ldots$$
when $x$ is measured, or to
$$\eta = \beta_0 + \beta_1 c + \beta_2 + \ldots$$
when x is 'not applicable'. The choice of $c$ is arbitrary, & does not affect the estimates of the intercept $\beta_0$ or the slope $\beta_1$; $\beta_2$ describes the effect of $x$'s being 'not applicable' compared to when $x=c$.
This isn't a suitable approach when the response varies according to an unknown value of $x$: the variability of the 'missing' group will be inflated, & estimates of other predictors' coefficients biased owing to confounding. Better to impute missing values.
Use of LASSO introduces two problems:
- The choice of $c$ affects the results as the amount of shrinkage applied depends on the magnitudes of the coefficient estimates.
- You need to ensure that $x_1$ & $x_2$ are either both in or both out of the model selected.
You can solve both of these by using rather the group LASSO with a group comprising $x_1$ & $x_2$: the $L_1$-norm penalty is applied to the $L_2$-norm of the orthonormalized matrix $\left[\vec{x_1}\ \vec{x_2}\right]$. (Categorical predictors are the poster child for group LASSO—you'd just code 'not applicable' as a separate level, as often done in unpenalized regression.) See Meier et al (2008), JRSS B, 70, 1, "The group lasso for logistic regression" & grplasso.
Best Answer
I don't think the two comments are wrong, but I think the $\textit{best}$ discussion will be in Ch. 6 of 'Introduction to Statistical Learning with R' . This book discusses both best subset and regularization techniques (ridge, lasse et al). Finally the chapter ends with an R lab in which the 'how to' of coding in R for both methods is done. If by 'two way ' interactions you mean simply a regression with quadratic terms in the variables, that is also explained in the book. The book website has links to a pdf of the book, the code in the book and various other things. It is http://www-bcf.usc.edu/~gareth/ISL/