Machine Learning – Bonferroni Correction and Its Applications in Machine Learning

bonferronicross-validationmachine learningoverfittingsvm

In psychology studies I learned that we should use the Bonferroni method to adjust the significance level when testing several hypothesis on a single dataset.

Currently I am working with machine learning methods such as Support Vector Machines or Random Forest for classification. Here I have a single dataset which is used in crossvalidation to find the best parameters (such as kernel parameters for SVM) yielding the best accuracy.

My intuition says (and maybe is completely off) that it is a similiar issue. If I am testing too many possible parameter combinations, the chance is high that I find one which yields great results. Yet this could be just chance.

To sum up my question:

In machine learning we use crossvalidation for finding the right parameters of a classifier. The more parameter combinations we use, the higher the chance to find a great one by accident (overfitting?).
Does the concept that is behind bonferroni correction also apply here?
Is it a different issue? If so, why?

Best Answer

There is a degree to which what you are talking about with p-value correction is related, but there are some details that make the two cases very different. The big one is that in parameter selection there is no independence in the parameters you are evaluating or in the data you are evaluating them on. For ease of discussion, I will take choosing k in a K-Nearest-Neighbors regression model as an example, but the concept generalizes to other models as well.

Lets say we have a validation instance V that we are predicting to get an accuracy of the model in for various values of k in our sample. To do this we find the k = 1,...,n closest values in the training set which we will define as T₁, ... ,T_n. For our first value of k = 1 our prediction P1₁ will equal T₁, for k=2, prediction P₂ will be (T₁ + T₂)/2 or P₁/2 + T₂/2, for k=3 it will be (T₁ + T₂ + T₃)/3 or P₂*2/3 + T₃/3. In fact for any value k we can define the prediction P_k = P_k-1(k-1)/k + T_k/k. We see that the predictions are not independant of each other so therefore the accuracy of the predictions won't be either. In fact, we see that the value of the prediction is approaching the mean of the sample. As a result, in most cases testing values of k = 1:20 will select the same value of k as testing k = 1:10,000 unless the best fit you can get out of your model is just the mean of the data.

This is why it is ok to test a bunch of different parameters on your data without worrying too much about multiple hypothesis testing. Since the impact of the parameters on the prediction isn't random, your prediction accuracy is much less likely to get a good fit due solely to chance. You do have to worry about over fitting still, but that is a separate problem from multiple hypothesis testing.

To clarify the difference between multiple hypothesis testing and over fitting, this time we will imagine making a linear model. If we repeatedly resample data for to make our linear model (the multiple lines below) and evaluate it, on testing data (the dark points), by chance one of the lines will make a good model (the red line). This is not due to it actually being a great model, but rather that if you sample the data enough, some subset will work. The important thing to note here is that the accuracy looks good on the held out testing data because of all the models tested. In fact since we are picking the "best" model based on the testing data, the model may actually fit the testing data better than the training data.

Over fitting on the other hand is when you build a single model, but contort the parameters to allow the model to fit the training data beyond what is generalizeable. In the example below the the model (line) perfectly fits the training data (empty circles) but when evaluated on the testing data (filled circles) the fit is far worse.

Best Answer

Related Solutions

Solved – How is cross validation different from data snooping

Solved – SVM Hyperparameters Tuning

Related Question