I'm currently working on a dataset and I'm using the AIC criterion with the function step
in R to achieve variable selection. Doing so, my model has reduced from 48 variables to 24 variables:
Residuals:
Min 1Q Median 3Q Max
-2576.8 -190.4 -93.8 33.4 15548.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 270.472 16.482 16.410 < 2e-16 ***
order 382.376 33.294 11.485 < 2e-16 ***
report 232.535 25.863 8.991 < 2e-16 ***
`000` 152.409 27.783 5.486 4.36e-08 ***
font 59.870 8.333 7.185 7.94e-13 ***
re -33.898 8.701 -3.896 9.95e-05 ***
receive 248.213 51.176 4.850 1.28e-06 ***
people 113.395 29.255 3.876 0.000108 ***
you -25.046 5.503 -4.551 5.48e-06 ***
george -17.225 5.100 -3.377 0.000738 ***
money 66.370 21.565 3.078 0.002100 **
addresses 131.646 40.453 3.254 0.001146 **
remove -77.934 23.188 -3.361 0.000784 ***
`650` -34.678 19.511 -1.777 0.075575 .
meeting -34.344 11.319 -3.034 0.002427 **
make 77.576 30.773 2.521 0.011743 *
project -30.727 13.822 -2.223 0.026266 *
edu -20.688 9.757 -2.120 0.034028 *
pm -36.348 20.704 -1.756 0.079223 .
`1999` 51.903 23.057 2.251 0.024430 *
original -70.763 41.520 -1.704 0.088396 .
labs -39.932 23.732 -1.683 0.092524 .
mail 23.066 13.974 1.651 0.098894 .
your -12.496 8.503 -1.470 0.141754
free -16.432 11.457 -1.434 0.151583
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 575.9 on 4185 degrees of freedom
Multiple R-squared: 0.1385, Adjusted R-squared: 0.1336
F-statistic: 28.04 on 24 and 4185 DF, p-value: < 2.2e-16
Now, the variables free
and your
don't seem to be significant. To decide whether to keep them in my model or not, I have compared the models with/without your
and with/without free
repeating 20 times a 10-fold cross validation.
Here's my code:
---------testing if "your" is useful to the model-----------------
me=0
mf=0
for (i in c(1:20)) {
e<-cv.lm(model, fit3, m=10,printit=FALSE,seed=i) #fit3 is the model without "your"
f<-cv.lm(model, fit, m=10,printit=FALSE,seed=i) #fit is the unrestricted model
me=me+attr(e,"ms")
mf=mf+attr(f,"ms")
}
me=me/20 #result : 334718
mf=mf/20 #result : 334659 ==> we keep "your"
---------testing if "free" is useful to the model-----------------
me=0
mf=0
for (i in c(1:20)) {
e<-cv.lm(model, fit2, m=10,printit=FALSE,seed=i) #fit2 is the model without "free"
f<-cv.lm(model, fit, m=10,printit=FALSE,seed=i) #fit is the unrestricted model
me=me+attr(e,"ms")
mf=mf+attr(f,"ms")
}
me=me/20 #result : 334726
mf=mf/20 #result : 334659 ==> we keep "free"
Now my questions are the following:
- Is that a good way to proceed? (Meaning: should I use CV to determine whether to keep those variables in my model or not?)
- Or should I perform some kind of hypothesis test to assess the relevance of those two variables?
Best Answer
First, as always, be careful with any kind of automated variable selection. Start with subject-matter expertise if you can, and be sure to do EDA (exploratory data analysis).
Second, is your goal primarily to have good prediction performance, not subject-matter understanding? You seem to have thousands of observations and only tens of variables -- so you may not really need to drop any variables. Even if some are spurious, it should barely affect performance whether or not you leave them in the model. Also, you may benefit more if you try some variable transformations, add interaction terms, or use something other than linear regression.
However, to answer your question directly... if you really do need to have a sparse, interpretable model, and if a linear model really is suitable, and if the dataset is large, with a high true signal-to-noise ratio... then your procedure might not be that bad, by ideal-case statistical theory. As sample sizes get very large, AIC tends to select models that are a little too big (too many variables). K-fold cross-validation tends to pick models which are still too big, but not as big as AIC's. So there's some justification for using AIC as a "cheap" first pass to whittle down your model, then using CV as an "expensive" second pass to cut it down further. A theoretician could have some fun studying your procedure further. But for statistical practice, where the ideal case never holds, this is really quite a stretch!
Finally, definitely don't do a hypothesis test. The p-values above are all wrong, because they do not account for the fact you chose this model by a stepwise procedure. There is some quite recent work on legitimate p-values for stepwise regression, but not yet on combining it with AIC.