Solved – Elastic net: dealing with wide data with outliers

cooks-distanceelastic nethigh-dimensionaloutliersregularization

Recently I was working on a dataset with ~300 observations and 1500 predictors. I used the glmnet package in R to fit an elastic net model, which gave me a cross-validated (regularised) R-square of 99%. It was suggested by subject matter experts that the data might contain influential/leverage points, that were distorting the model fit. To test this, I refit my model on an 80% subsample, using the remaining 20% as a validation dataset. Sure enough, my R-square on the validation data dropped to 10%.

What are the suggested strategies for detecting/handling outliers and leverage points in wide datasets? The standard definitions for leverage and Cook's distance involve calculating the hat matrix; does this still make sense for a regularised model with $p \gg n$?

Also, is there any R package that robustifies the basic elastic net algorithm to handle outliers and influential points? (I realise that it may be hard to do this sensibly for a 1500-dimensional problem.)

Best Answer

Couple of things I just thought I will mention. It is hard to mention specifics without actually looking at the results but hope this is helpful. Most of these things I am sure you already know but in case you missed something

1) The 10% v/s 90% check if you used glmnet or cv.glmnet. You should be using cv.glmnet. The first one fits the penalty parameter on the entire data set. 99% seems like an overfit estimate. You may not have made a mistake but no harm in specifying.

2) Since p >> N every point is technically an outlier (curse of dimensionality) so I am not quite sure what you mean. Nevertheless, there is this technique called the Bo-Lasso or the bootstrapped lasso. What it does in essence is does the sub-sample experiments as you have tried and retains only those predictors which appear in more than 80% of the LASSO fits. Needless to say it is slow, but the predictors it selects have some nice asymptotic properties

see http://www.di.ens.fr/~fbach/fbach_bolasso_icml2008.pdf

3) As far as influential points, again, I am a little confused. What is an influential point? Most techniques like the LASSO are using L-1 penalties to generate the sparsity so as a result you have influential PREDICTORS not influential POINTS. On the other hand, if you work with something like Support Vector Machines (SVM) you will get influential POINTS (points of support is essentially I think what you are talking about)

Sorry for not being able to be more specific. I hope these tips help