Solved – Variable selection: combining stepwise regresion and cross validation

aiccross-validationfeature selectionrstepwise regression

I'm currently working on a dataset and I'm using the AIC criterion with the function step in R to achieve variable selection. Doing so, my model has reduced from 48 variables to 24 variables:

Residuals:
    Min      1Q  Median      3Q     Max 
-2576.8  -190.4   -93.8    33.4 15548.2 

Coefficients:
          Estimate Std. Error t value Pr(>|t|)    
(Intercept)  270.472     16.482  16.410  < 2e-16 ***
order        382.376     33.294  11.485  < 2e-16 ***
report       232.535     25.863   8.991  < 2e-16 ***
`000`        152.409     27.783   5.486 4.36e-08 ***
font          59.870      8.333   7.185 7.94e-13 ***
re           -33.898      8.701  -3.896 9.95e-05 ***
receive      248.213     51.176   4.850 1.28e-06 ***
people       113.395     29.255   3.876 0.000108 ***
you          -25.046      5.503  -4.551 5.48e-06 ***
george       -17.225      5.100  -3.377 0.000738 ***
money         66.370     21.565   3.078 0.002100 ** 
addresses    131.646     40.453   3.254 0.001146 ** 
remove       -77.934     23.188  -3.361 0.000784 ***
`650`        -34.678     19.511  -1.777 0.075575 .  
meeting      -34.344     11.319  -3.034 0.002427 ** 
make          77.576     30.773   2.521 0.011743 *  
project      -30.727     13.822  -2.223 0.026266 *  
edu          -20.688      9.757  -2.120 0.034028 *  
pm           -36.348     20.704  -1.756 0.079223 .  
`1999`        51.903     23.057   2.251 0.024430 *  
original     -70.763     41.520  -1.704 0.088396 .  
labs         -39.932     23.732  -1.683 0.092524 .  
mail          23.066     13.974   1.651 0.098894 .  
your         -12.496      8.503  -1.470 0.141754    
free         -16.432     11.457  -1.434 0.151583    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 575.9 on 4185 degrees of freedom
Multiple R-squared:  0.1385,    Adjusted R-squared:  0.1336 
F-statistic: 28.04 on 24 and 4185 DF,  p-value: < 2.2e-16

Now, the variables free and your don't seem to be significant. To decide whether to keep them in my model or not, I have compared the models with/without your and with/without free repeating 20 times a 10-fold cross validation.
Here's my code:

---------testing if "your" is useful to the model----------------- 
me=0
mf=0
for (i in c(1:20)) {
  e<-cv.lm(model, fit3, m=10,printit=FALSE,seed=i) #fit3 is the model without "your"
  f<-cv.lm(model, fit, m=10,printit=FALSE,seed=i) #fit is the unrestricted model
 me=me+attr(e,"ms")
 mf=mf+attr(f,"ms")
}   
me=me/20 #result : 334718
mf=mf/20 #result : 334659 ==> we keep "your"

---------testing if "free" is useful to the model----------------- 

me=0
mf=0
for (i in c(1:20)) {
e<-cv.lm(model, fit2, m=10,printit=FALSE,seed=i) #fit2 is the model without "free"
f<-cv.lm(model, fit, m=10,printit=FALSE,seed=i) #fit is the unrestricted model
me=me+attr(e,"ms")
mf=mf+attr(f,"ms")
}
me=me/20 #result : 334726
mf=mf/20 #result : 334659 ==> we keep "free"

Now my questions are the following:

  • Is that a good way to proceed? (Meaning: should I use CV to determine whether to keep those variables in my model or not?)
  • Or should I perform some kind of hypothesis test to assess the relevance of those two variables?

Best Answer

First, as always, be careful with any kind of automated variable selection. Start with subject-matter expertise if you can, and be sure to do EDA (exploratory data analysis).

Second, is your goal primarily to have good prediction performance, not subject-matter understanding? You seem to have thousands of observations and only tens of variables -- so you may not really need to drop any variables. Even if some are spurious, it should barely affect performance whether or not you leave them in the model. Also, you may benefit more if you try some variable transformations, add interaction terms, or use something other than linear regression.

However, to answer your question directly... if you really do need to have a sparse, interpretable model, and if a linear model really is suitable, and if the dataset is large, with a high true signal-to-noise ratio... then your procedure might not be that bad, by ideal-case statistical theory. As sample sizes get very large, AIC tends to select models that are a little too big (too many variables). K-fold cross-validation tends to pick models which are still too big, but not as big as AIC's. So there's some justification for using AIC as a "cheap" first pass to whittle down your model, then using CV as an "expensive" second pass to cut it down further. A theoretician could have some fun studying your procedure further. But for statistical practice, where the ideal case never holds, this is really quite a stretch!

Finally, definitely don't do a hypothesis test. The p-values above are all wrong, because they do not account for the fact you chose this model by a stepwise procedure. There is some quite recent work on legitimate p-values for stepwise regression, but not yet on combining it with AIC.

Related Question