Solved – Elastic net: dealing with wide data with outliers

cooks-distanceelastic nethigh-dimensionaloutliersregularization

Recently I was working on a dataset with ~300 observations and 1500 predictors. I used the glmnet package in R to fit an elastic net model, which gave me a cross-validated (regularised) R-square of 99%. It was suggested by subject matter experts that the data might contain influential/leverage points, that were distorting the model fit. To test this, I refit my model on an 80% subsample, using the remaining 20% as a validation dataset. Sure enough, my R-square on the validation data dropped to 10%.

What are the suggested strategies for detecting/handling outliers and leverage points in wide datasets? The standard definitions for leverage and Cook's distance involve calculating the hat matrix; does this still make sense for a regularised model with $p \gg n$?

Also, is there any R package that robustifies the basic elastic net algorithm to handle outliers and influential points? (I realise that it may be hard to do this sensibly for a 1500-dimensional problem.)

Best Answer

Couple of things I just thought I will mention. It is hard to mention specifics without actually looking at the results but hope this is helpful. Most of these things I am sure you already know but in case you missed something

1) The 10% v/s 90% check if you used glmnet or cv.glmnet. You should be using cv.glmnet. The first one fits the penalty parameter on the entire data set. 99% seems like an overfit estimate. You may not have made a mistake but no harm in specifying.

2) Since p >> N every point is technically an outlier (curse of dimensionality) so I am not quite sure what you mean. Nevertheless, there is this technique called the Bo-Lasso or the bootstrapped lasso. What it does in essence is does the sub-sample experiments as you have tried and retains only those predictors which appear in more than 80% of the LASSO fits. Needless to say it is slow, but the predictors it selects have some nice asymptotic properties

see http://www.di.ens.fr/~fbach/fbach_bolasso_icml2008.pdf

3) As far as influential points, again, I am a little confused. What is an influential point? Most techniques like the LASSO are using L-1 penalties to generate the sparsity so as a result you have influential PREDICTORS not influential POINTS. On the other hand, if you work with something like Support Vector Machines (SVM) you will get influential POINTS (points of support is essentially I think what you are talking about)

Sorry for not being able to be more specific. I hope these tips help

Related Solutions

R Time Series – How to Fit an ARIMAX Model with Regularization or Penalization

This is not a solution but some reflections on the possibilities and difficulties that I know of.

Whenever it is possible to specify a time-series model as $$Y_{t+1} = \mathbf{x}_t \beta + \epsilon_{t+1}$$ with $\mathbf{x}_t$ computable from covariates and time-lagged observations, it is also possible to compute the least-squares elastic net penalized estimator of $\beta$ using glmnet in R. It requires that you write code to compute $\mathbf{x}_t$ to form the model matrix that is to be specified in glmnet. This works for AR-models but not directly for ARMA-models, say. Moreover, the cross-validation procedures of glmnet are not sensible per se for time-series data.

For more general models $$Y_{t+1} = f(\mathbf{x}_t, \beta) + \epsilon_{t+1}$$ an implementation of an algorithm for computing the non-linear least-squares elastic net penalized estimator of $\beta$ is needed. To the best of my knowledge there is no such implementation in R. I am currently writing an implementation to solve the case where
$$Y_{t+1} = \mathbf{x}_t g(\beta) + \epsilon_{t+1}$$ the point being that it is paramount for model selection that the lasso penalization is on $\beta$ and not $g(\beta)$. If I recall the ARIMA-parametrization correctly it also takes this form $-$ but I cannot offer any code at the moment. It is (will be) based on A coordinate gradient descent method for nonsmooth separable minimization.

Another issue is the selection of the amount of penalization (the tuning parameters). It will generally require a form of cross-validation for time-series, but I hope to be able to work out some less computationally demanding methods for specific models.

Solved – Checking for outliers in a glmer (lme4 package) with 3 random factors

try the romr.fnc in the LMERConvenienceFunctions to remove outliers

df3.trimmed = romr.fnc(m, df3, trim = 2.5)
df3.trimmed = df3.trimmed$data

update initial model on trimmed data

mB = update(m1)

Best Answer

Related Solutions

R Time Series – How to Fit an ARIMAX Model with Regularization or Penalization

Solved – Checking for outliers in a glmer (lme4 package) with 3 random factors

Related Question