Solved – Variable selection: combining stepwise regresion and cross validation

aiccross-validationfeature selectionrstepwise regression

I'm currently working on a dataset and I'm using the AIC criterion with the function step in R to achieve variable selection. Doing so, my model has reduced from 48 variables to 24 variables:

Residuals:
    Min      1Q  Median      3Q     Max 
-2576.8  -190.4   -93.8    33.4 15548.2 

Coefficients:
          Estimate Std. Error t value Pr(>|t|)    
(Intercept)  270.472     16.482  16.410  < 2e-16 ***
order        382.376     33.294  11.485  < 2e-16 ***
report       232.535     25.863   8.991  < 2e-16 ***
`000`        152.409     27.783   5.486 4.36e-08 ***
font          59.870      8.333   7.185 7.94e-13 ***
re           -33.898      8.701  -3.896 9.95e-05 ***
receive      248.213     51.176   4.850 1.28e-06 ***
people       113.395     29.255   3.876 0.000108 ***
you          -25.046      5.503  -4.551 5.48e-06 ***
george       -17.225      5.100  -3.377 0.000738 ***
money         66.370     21.565   3.078 0.002100 ** 
addresses    131.646     40.453   3.254 0.001146 ** 
remove       -77.934     23.188  -3.361 0.000784 ***
`650`        -34.678     19.511  -1.777 0.075575 .  
meeting      -34.344     11.319  -3.034 0.002427 ** 
make          77.576     30.773   2.521 0.011743 *  
project      -30.727     13.822  -2.223 0.026266 *  
edu          -20.688      9.757  -2.120 0.034028 *  
pm           -36.348     20.704  -1.756 0.079223 .  
`1999`        51.903     23.057   2.251 0.024430 *  
original     -70.763     41.520  -1.704 0.088396 .  
labs         -39.932     23.732  -1.683 0.092524 .  
mail          23.066     13.974   1.651 0.098894 .  
your         -12.496      8.503  -1.470 0.141754    
free         -16.432     11.457  -1.434 0.151583    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 575.9 on 4185 degrees of freedom
Multiple R-squared:  0.1385,    Adjusted R-squared:  0.1336 
F-statistic: 28.04 on 24 and 4185 DF,  p-value: < 2.2e-16

Now, the variables free and your don't seem to be significant. To decide whether to keep them in my model or not, I have compared the models with/without your and with/without free repeating 20 times a 10-fold cross validation.
Here's my code:

---------testing if "your" is useful to the model----------------- 
me=0
mf=0
for (i in c(1:20)) {
  e<-cv.lm(model, fit3, m=10,printit=FALSE,seed=i) #fit3 is the model without "your"
  f<-cv.lm(model, fit, m=10,printit=FALSE,seed=i) #fit is the unrestricted model
 me=me+attr(e,"ms")
 mf=mf+attr(f,"ms")
}   
me=me/20 #result : 334718
mf=mf/20 #result : 334659 ==> we keep "your"

---------testing if "free" is useful to the model----------------- 

me=0
mf=0
for (i in c(1:20)) {
e<-cv.lm(model, fit2, m=10,printit=FALSE,seed=i) #fit2 is the model without "free"
f<-cv.lm(model, fit, m=10,printit=FALSE,seed=i) #fit is the unrestricted model
me=me+attr(e,"ms")
mf=mf+attr(f,"ms")
}
me=me/20 #result : 334726
mf=mf/20 #result : 334659 ==> we keep "free"

Now my questions are the following:

Is that a good way to proceed? (Meaning: should I use CV to determine whether to keep those variables in my model or not?)
Or should I perform some kind of hypothesis test to assess the relevance of those two variables?

Best Answer

First, as always, be careful with any kind of automated variable selection. Start with subject-matter expertise if you can, and be sure to do EDA (exploratory data analysis).

Second, is your goal primarily to have good prediction performance, not subject-matter understanding? You seem to have thousands of observations and only tens of variables -- so you may not really need to drop any variables. Even if some are spurious, it should barely affect performance whether or not you leave them in the model. Also, you may benefit more if you try some variable transformations, add interaction terms, or use something other than linear regression.

However, to answer your question directly... if you really do need to have a sparse, interpretable model, and if a linear model really is suitable, and if the dataset is large, with a high true signal-to-noise ratio... then your procedure might not be that bad, by ideal-case statistical theory. As sample sizes get very large, AIC tends to select models that are a little too big (too many variables). K-fold cross-validation tends to pick models which are still too big, but not as big as AIC's. So there's some justification for using AIC as a "cheap" first pass to whittle down your model, then using CV as an "expensive" second pass to cut it down further. A theoretician could have some fun studying your procedure further. But for statistical practice, where the ideal case never holds, this is really quite a stretch!

Finally, definitely don't do a hypothesis test. The p-values above are all wrong, because they do not account for the fact you chose this model by a stepwise procedure. There is some quite recent work on legitimate p-values for stepwise regression, but not yet on combining it with AIC.

Related Solutions

Solved – way to use cross validation to do variable/feature selection in R

I believe what you describe is already implemented in the caret package. Look at the rfe function or the vignette here: http://cran.r-project.org/web/packages/caret/vignettes/caretSelection.pdf

Now, having said that, why do you need to reduce the number of features? From 70 to 20 isn't really an order of magnitude decrease. I would think you'd need more than 70 features before you would have a firm prior believe that some of the features really and truly don't matter. But then again, that's where a subjective prior comes in I suppose.

Feature Selection and Cross-Validation – Techniques and Best Practices

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).

The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).

HTH

Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:

Biased estimator: erate = 0.429210 (0.397683 - 0.451737)

Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)

The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.

NF    = 56;
NC    = 259;
NFOLD = 10;
NMC   = 1e+4;

% perform Monte-Carlo simulation of biased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform feature selection

   err       = mean(repmat(y,1,NF) ~= x);
   [err,idx] = min(err);

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, '  Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

% perform Monte-Carlo simulation of unbiased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      % perform feature selection

      err       = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
      [err,idx] = min(err);

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

Best Answer

Related Solutions

Solved – way to use cross validation to do variable/feature selection in R

Feature Selection and Cross-Validation – Techniques and Best Practices

Related Question