Solved – Find out if using k-fold cross-validation helped to overcome overfitting (Machine Learning standard)

cross-validationmachine learningoverfitting

One of the main way to overcome overfitting is using $K$ fold cross-validation, and as this paragraph in cross-validation wiki page says:

The goal of cross-validation is to estimate the expected level of fit of a model to a data set that is independent of the data that were used to train the model. It can be used to estimate any quantitative measure of fit that is appropriate for the data and model. For example, for binary classification problems, each case in the validation set is either predicted correctly or incorrectly. In this situation the misclassification error rate can be used to summarize the fit, although other measures like positive predictive value could also be used. When the value being predicted is continuously distributed, the mean squared error, root mean squared error or median absolute deviation could be used to summarize the errors.

Somewhere I read if all $K$ repeats of training and evaluating the model give nearly the same performance ,this indicates that we overcome the overfitting issue – which otherwise, without using k-fold CV, would be a main drawback of dealing with same number of features and observation.

I have this general question: when using cross validation, how to find out if using it actually helped or not?

In more details:

  1. How much variance is a good indicator that we don't have overfitting any more? (meaning how much as giving an number such as < 0.2, for example).
  2. Maybe the best way to answer this question is measuring which of these 2 methods, performance analysis or error analysis is best? (in more detail: which one of mean squared error, root mean squared error or median absolute deviation or performance variance measurement is best?).

Note 1: maybe my question is somehow repetitive, but I tried my best to read all related questions that were asked in the past, but non of them discuss this problem by considering giving a number. I want to report my result as such (for a problem that I previously asked here, and when I want to say I overcame overfitting and report a number for indication of this I don't know exactly how much is good).

Note 2: in my project i had 1440 observation and 1000 feature and i used LDA for classification and 10-fold Cross-Validation.if accuracy measurement on all 10 test dataset give me nearly same result , is it indicate that overfitting is not problem any more?

Note 3: i don't know if this going to help or not , but my work is a supervised learning problem.

Note 4: at first i want thank all of you who answer my question, there is problem whenever i read this site i have , i work in machine learning field, so most of the time i should try to check the term that is used in given answer and see if they are in Static standard or machine learning standard.so i will appreciate if you could say which point of view or standard you use in your answer.

Best Answer

if all K repeats of training and evaluating the model give nearly the same performance ,this indicates that we overcome the overfitting issue

That is not true, see below.

How much variance is a good indicator that we don't have overfitting any more? (meaning how much as giving an number such as < 0.2, for example).

Variance may be one of the symptoms of overfitting, but

  • there are more direct indicators of overfitting
  • and you may observe testing variance due to having only few test cases also for non-overfit models.
  • However, model instability (variance between models and between the predictions for the same case) almost always comes with overfitting. Depending on the modeling algorithm, instability may be measured by comparing the fitted model parameters across the k surrogate models (but there can be "spurious" variance between equivalent models that have different parameter sets.). Also instability can be measured more directly by comparing predictions for the same test case across models from different runs/iterations/repetitions of k-fold cross validation.

which one of mean squared error, root mean squared error or median absolute deviation or performance variance measurement is best?

The figure of merit doesn't matte at all. It just needs to be a sensible metric of performance for your application.

Note that variance can be calculated for each of the errors / loss functions.

How to measure overfitting

I think the most straightforward way of measuring overfitting is to compare the model's internal estimate of error (e.g. training error, inner cross validation error with data-driven optimization/selection) with the external independent error estimate. A large discrepancy between those two estimates indicates overfitting.

Large should be seen as compared to the variance that results from testing only the finite number of cases you have at hand.