Solved – Final Model Prediction using K-Fold Cross-Validation and Machine Learning Methods

cross-validationmachine learningregression

How to choose a predictive model after k-fold cross-validation?

My question is quite simple and is definitely related to the similar threads above but what I am looking for is a concrete yes/no to the question below:

I am working on a regression problem where I have a target function of 1 variable that I am trying to predict using 5 explanatory variables. I have 1200 examples of the response and explanatory data. I decide to split my 1200 examples into a calibration set of 1000 examples and a test set of 200 examples. The calibration set is used to train my model and the test set is completely independent.

Let's say I am using a Neural Network of a particular configuration/parametrization and I am looking to find the best possible network weights and biases such that it provides the best performance on my test set.

To do this I have chosen to perform k-fold cross-validation on the calibration data. Let's say I opt for 10 folds. I thus produce 10 different calibrated models (using the training and validation sets for each k-fold) each of the same configuration using the Neural Network described above. I now want to use the Neural Network to provide an output on my test set using the parameters (weight and biases) determined from the k-fold cross-validation. To produce the estimates on the test set do I simply average the weights and biases from each of the 10 different calibrated models and use this parametrization to produce outputs to compare with my test set for the target function?

Thank you everyone for their help!

Best Answer

"To produce the estimates on the test set do I simply average the weights and biases from each of the 10 different calibrated models and use this parametrization to produce outputs to compare with my test set for the target function?"

No. Cross-validation is a procedure for estimating the test performance of a method of producing a model, rather than of the model itself. So the best thing to do is to perform k-fold cross-validation to determine the best hyper-parameter settings, e.g. number of hidden units, values of regularisation parameters etc. Then train a single network on the whole calibration set (or several and pick the one with the best value of the regularised training criterion to guard against local minima). Evaluate the performance of that model using the test set.

In the case of neural networks, averaging the weights and biases of individual models won't work as different models will choose different internal representations, so the corresponding hidden units of different networks will represent different (distributed) concepts. If you average their weights, they mean of these concepts will be meaningless.

Related Solutions

Solved – Is it ok to determine early stopping using the validation set in 10-fold cross-validation

I am not completely clear of what the question is asking, but I think the answer is no. The thing you need to think hard about with cross-validation is that no part of your algorithm can have any access to the test set. If it does, then your cross-validation results will be tainted and not be an accurate measure of the 'true' error.

From your question, I assume you are using some kind of iterative learning algorithm such as GBM and you are using the validation set as a way of determining when your GBM has enough models in its ensemble and has started to overfit. If this is true, then what you are doing is not optimal.

The way to think of this is that the stopping criteria is part of your learning algorithm. If it is part of the algorithm, then it can't use the test set in any way.

You may need to do nested cross-validation. In your outer loop, you divide into test and training sets, then in your inner loop you further divide the training set into sub test and training sets and proceed as you have. The inner loop cross-validation can be used to learn from that training set when to stop the learning, but to get an accurate generalization error you then need to apply that to the test set from the outer loop that hasn't yet been touched by the inner loop whose aim was to find, from the training data, when the best time to stop is. To be clear, say the inner loop cross-validation found that the best number of iterations was 10. In your outer loop you learn a model using the full outer loop training set, iterating 10 times, then see how that performs on the test set.

Does this make sense?

Note that depending on the models in use and the dataset, this may or may not be a big issue. The downside is that nested cross-validation can be very computationally expensive. Doing things the way you have been may well be an appropriate trade-off between accuracy and computational time in your circumstance. The most rigid answer to your question is no, it is not completely valid cross-validation. Whether it is passable for your circumstances is a different question.

Solved – 10 fold cross validation model in weka

You can use the Evaluate class to perform this 10-fold cross-validation. To define the cross-validation you have to set the parameter as '-x 10' in the EvaluateModel.

clear all; close all; clc;

%% Add jar file to path plus import dependencies
javaaddpath('/usr/local/weka-3-6-11/weka.jar');
import weka.classifiers.trees.RandomForest.*;
import weka.classifiers.meta.Bagging.*;
import weka.classifiers.Evaluation.*;
import weka.core.Instances.*

%% load the arff file and extract the informations
filename = 'algo_output/results_features_labeling2_2class.arff';
loader = weka.core.converters.ArffLoader();
loader.setFile(java.io.File(filename));
data = loader.getDataSet();
data.setClassIndex(D.numAttributes()-1);

%% classification
classifier = weka.classifiers.functions.MultilayerPerceptron();
classifier.buildClassifier(data);
classifier.toString()

%% 10-fold cross-validation
ev = weka.classifiers.Evaluation(data);

v(1) = java.lang.String('-t');
v(2) = java.lang.String(filename);
v(3) = java.lang.String('-x');
v(4) = java.lang.String('10');
v(5) = java.lang.String('-i');

prm = cat(1,v(1:end));
ev.evaluateModel(classifier, prm)

For more information, check this link.

Best Answer

Related Solutions

Solved – Is it ok to determine early stopping using the validation set in 10-fold cross-validation

Solved – 10 fold cross validation model in weka

Related Question