Solved – Evaluate prediction model – K Fold Cross Validation

cross-validationpredictive-models

I have a dataset of 240 samples data, with 12 variables (independent variables). From these 12 variables, i would like to identify significant variables for prediction. I perform Gamma analysis, since the data is highly positive skewed.

Here's what I am doing currently:
1. Split the data set into 70% (training data) and 30% (test data)
2. I used 70% data (168 data) to build prediction model using Gamma analysis. I run the analysis few times by excluding each variable at one time to get the best final model.
3. Then validate the final model using the remaining 20% data.

My problem: When should I use K-fold cross validation? Is it when building the prediction model using 70% data, or after i get the final model and use k-fold CV on the remaining 20% data?

Best Answer

Traditionally, you do three steps of "tuning", so you need to split your sample into three parts: a training set, a cross-validation set and a test set.

Training (~60%) In training, you simply estimate your model, but you don't make any changes to the model based on the results (accuracy, goodness of fit) from the training data to avoid overfitting the training set.

Cross validation (~20%) After training your model, you can tune it - vary hyperparameters, remove features, or even select between different models - based on its performance on the cross validation set.

As an example, let's say you want to test which variables to include and which to leave: You specify three different variable combinations (three different models). You train all of them using your training set. Then you evaluate all of them using the cross validation set and select the one that performs best on the CV set.

K-fold CV If you are interested in doing k-fold validation, you repeat exactly what's written above, with one major difference: instead of hard-selecting the 60% and 20% for your training and CV sets, you run the training and validation procedures K-times, each time selecting a different random subsample for training and cross validation. Then you get a set of K results (accuracy, goodness of fit) that you can average to get a more robust estimate of your model's performance.

E.g., if you do 10-fold CV, you'd run it 10 times, and each time you'd randomly sample 10% of your data to be a cross-validation set, with the rest being a training set.

Test set (~20%) After tuning the model and/or selecting the best one, you can test it using the test set. This is data that the model has not seen yet and you shouldn't make any changes to the model based on the test set. This is the very last stage of building the model, only used to evaluate your final model, not to tune it any more (you don't want to overfit your test set).

If doing k-fold CV, you still have to leave out a test set that is separate from your training/CV set you are sampling from.

Putting it all together So in your case, you have $N=240$ and the number of variables is $12$. So the first split of the data would be training/CV (70-80%) and test (20-30%). Which in your case would be $168-192$ for training/CV and $48-72$ for test. Then, in selecting the variables to include, for each model (combination of variables), do K-fold CV as follows:

Split your training/CV set into K equal (random) subsets.
Estimate your model K times, each time leaving out one of the K subsets.
Cross-validate each estimate with the subset that was left out.
Pool your cross-validation results across all the K estimates.

Then pick the model that performs best in CV (on average). Evaluate it on the test set. Don't change it any more.

Related Solutions

Solved – How to select the final model with elastic net feature selection, cross validation and SVM

By "for each fold I get a different set of features", I suspect you mean that you are using a k-fold cross-validation procedure to estimte the performance of the model. The thing to remember about cross-validation is that you are estimating the perfomance of a method of constructing a model, not the model itself. So you form the final model, just use the procedure used in each fold of the cross-validation, but using all of the data, rather than (k-1)/k of it.

I am not sure there is much to be gained from using an elastic net to choose the features for an SVM. The SVM is an approximate implementation of a bound on the generalisation performance, which is independent of the dimensionality of the input space, so with a good choice of C, it should work just fine in a 10,000 dimensional feature space (this is what I have found via practical experience as well).

As a sort of belt-and-braces approach, you could use bootstrapped SVMs, and use the out-of-bag error to estimate performance. If you have a linear SVM, then you can combine all of the bootstrapped SVMs into a single linear model after training, so there is no performance problem in operation. Likewise an average of the elastic net models will probably work pretty well also.

Solved – 10 fold cross validation model in weka

You can use the Evaluate class to perform this 10-fold cross-validation. To define the cross-validation you have to set the parameter as '-x 10' in the EvaluateModel.

clear all; close all; clc;

%% Add jar file to path plus import dependencies
javaaddpath('/usr/local/weka-3-6-11/weka.jar');
import weka.classifiers.trees.RandomForest.*;
import weka.classifiers.meta.Bagging.*;
import weka.classifiers.Evaluation.*;
import weka.core.Instances.*

%% load the arff file and extract the informations
filename = 'algo_output/results_features_labeling2_2class.arff';
loader = weka.core.converters.ArffLoader();
loader.setFile(java.io.File(filename));
data = loader.getDataSet();
data.setClassIndex(D.numAttributes()-1);

%% classification
classifier = weka.classifiers.functions.MultilayerPerceptron();
classifier.buildClassifier(data);
classifier.toString()

%% 10-fold cross-validation
ev = weka.classifiers.Evaluation(data);

v(1) = java.lang.String('-t');
v(2) = java.lang.String(filename);
v(3) = java.lang.String('-x');
v(4) = java.lang.String('10');
v(5) = java.lang.String('-i');

prm = cat(1,v(1:end));
ev.evaluateModel(classifier, prm)

For more information, check this link.

Best Answer

Related Solutions

Solved – How to select the final model with elastic net feature selection, cross validation and SVM

Solved – 10 fold cross validation model in weka

Related Question