To answer this question, you should ask what you want your classifier to be able to do.
If I understand correctly, when you train your classifier on the 'measurement level' you would 'teach' the classifier to distinguish (classify) a set of features of a single subject. This is different from training it to classify any set of features independent of what subject it came from.
Assuming you want your classifier to be able to classify any set of features, independent of what subject those features came from, you should not do any cross-validation on the 'measurement level'.
In this same setting. I do not exactly understand why you would consider all the measurements of a single subject as 'one' (In the context of the leave-one-out cross-validation). Did you consider a single set of features (a single measurement), independent of what subject it came from, as being 'one' thing?
[EDIT]: I just discussed this problem with a colleague. Doing cross-validation on the entire dataset (independent of subject) will leak information.
If part of the measurements of a subject are included in the training set and if the other measurements of that same subject are in the testing set, this will overestimate the performance! Although, if you do not get any measurements from new subjects, this is not a problem.
In the case you do get measurements from new subjects (more likely, I think), then it is a good idea to include all the measurements of a subject in the testing set, if that subject was selected to be in that set.
Your understanding sounds good to me with the possible exception that what you call "run" in my field is called either fold (as in 5-fold cross validation) if the test data is meant or "surrogate model" if we're talking about the model.
Yes, the outer folds can return different hyperparameter sets and/or parameters (coefficients).
This is valid in the sense that this is allowed to happen. It is invalid in the sense that this means the optimization (done with the help of the inner folds) is not stable, so you have not actually found "the" [global] optimum.
For the overall model, you're supposed to run the inner cross validation again on the whole data set. I.e., you optimize/auto-tune your hyperparameters on the training set (now the whole data set) just the same as you did during the outer cross validation.
Update: longer explanation
See also Nested cross validation for model selection and How to build the final model and tune probability threshold after nested cross-validation?
Are you saying that to get the "the [global] optimum" you need to run your entire dataset on all the combinations of c's, gamma's, kernels etc?
No. In my experience the problem is not that the search space is not explored in detail (all possible combinations) but rather that our measurement of the resulting model performance is subject to uncertainty.
Many optimization strategies coming from numerical optimization implicitly assume that there is negligible noise on the target functional. I.e. the functional is basically a smooth, continuous function of the hyperparameters. Depending on the figure of merit you optimize and the number of cases you have, this assumption may or may not be met.
If you do have considerable noise on the estimate of the figure of merit but do not take this into account (i.e. the "select the best one" strategy you mention), your observed "optimum" is subject to noise.
In addition, the noise (variance uncertainty) on the performance estimate increases with model complexity. In this situation, naive "select be best observed performance" can also lead to a bias towards too complex models.
See e.g. Cawley, G. C. & Talbot, N. L. C.: On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, Journal of Machine Learning Research, 11, 2079-2107 (2010).
How does this get incorporated into the nested cross validation procedure or the final results of the analysis?
Hastie, T. and Tibshirani, R. and Friedman, J. The Elements of Statistical Learning; Data mining, Inference andPrediction Springer Verlag, New York, 2009 in chapter 7.10 say:
Often a “one-standard
error” rule is used with cross-validation, in which we choose the most par-
simonious model whose error is no more than one standard error above
the error of the best model.
Which I find a good heuristic (I take the additional precaution to estimate both variance uncertainty due to the limited numebr of cases as well as due to model instability - the Elements of Statistical Learning do not discuss this in their cross validation chapter).
So your understanding:
I'm confused because my understanding is that you can't just run your analysis hundreds/thousands of times with different parameters/kernels and select the best one
is correct.
However, your understanding
(and nested CV is supposed to mitigate the associated issues).
may or may not be correct:
- nested CV does not make the hyperparameter optimization any more successful,
but it can provide an honest estimate of the performance that can be achieved with that particular optimization strategy.
In other words: it guards against overoptimism about the achieved performance, but it does not improve this performance.
The final model:
- The outer split of the nested CV is basically an ordinary CV for validation/verification. It splits the availabel data set into training and testing subsets, and then builds a so-called surrogate model using the training set.
- During this training, you happen to do another (the inner) CV, whose performance estimates you use to fix/optimize the hyperparameters. But seen from the outer CV, this is just part of the model training.
The model training on the whole data set should just do the same the model training of the cross validation did. Otherwise the surrogate models and their performance estimates would not be a good surrogate for the model trained on the whole data (and that is really the purpose of the surrogate models).
Thus: run the auto-tuning of hyperparameters on the whole data set just as you do during cross validation. Same hyperparameter combinations to consider, same strategy for selecting the optimum. In short: same training algorithm, just slightly different data (1/k additional cases).
Best Answer
You need to define a procedure that you always follow. I see two valid options here:
Edit:
Which method to follow (replace by value computed within subject or by value computed within training set) should IMHO be decided from the knowledge about the application and the data, we cannot tell you more than very general guidelines here.
Why not by value computed within test set?: That would mean that the value used to replace
NA
s in test subject A depends on whether subject A is tested together with subject B or subject C – which doesn't seem to be a desirable or sensible behaviour to me.You may also want to look up "Imputation" which is the general term for techniques that fill in missing values.
Centering and scaling (standardization): if you have "external" (scientific) knowledge that suggests that a standardization within the subjects should take place, then go ahead with that. Whether this is sensible depends on your application and data, so we cannot answer this question. For a more general discussion of centering and standardization, see e.g. Variables are often adjusted (e.g. standardised) before making a model - when is this a good idea, and when is it a bad one? and When conducting multiple regression, when should you center your predictor variables & when should you standardize them?
With 4 subjects you probably won't be able to compare classifiers anyways: the uncertainty due to having only 4 test cases is far too high. Can't you fix this parameter by experience with similar data?
To illustrate the problem: assume you observe 4 correct classifications out of 4 test subjects. That gives you a point estimate of 100% correct predictions. If you look at confidence interval calculations for this, you get e.g. for the Agresti-Coull method a 95% ci of 45 - 105% (obviously not very precise with the small sample size), Bayes method with uniform prior makes it 55 - 100%. In any case it means that even if you observe perfect test results, it is not quite clear whether you can claim that the model is actually better than guessing. As long as you do not need to fear that fixing the parameter beforehand will produce a model that is clearly worse than guessing, you anyways cannot measure improvements in the practically important range.
The situation may be less drastic if you optimize e.g. Brier score but with 4 subjects I'd suspect that you still do not reach the precision you need for the expected improvement during the optimization.
Edit: Unfortunately, while 20 subjects are far more than 4, from a classifier validation statistics point of view, 20 is still very few.
We recently concluded that if you need to stick with the frequently used proportions for characterizing your classifier, at least in our field you cannot expect to have a useful precision in the test results with less than 75 - 100 test subjects (in the denominator of the proportion!). Again, you may be better off if you can switch to e.g. Brier's score, and with a paired design for classifier comparison, but I'd call it lucky if that gains you a factor of 5 in sample size.
You can find our thoughts here: Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323
AFAIK, dealing with the random uncertainty on test results during classifier optimization is an unsolved problem. (If not, I'd be extremely interested to papers about the solution!)
So my recommendation would be to do a preliminary experiment/analysis at the end of which you try to estimate the random uncertainty on the comparison results. If these do not allow to optimize (which I'd unfortunately expect to be the outcome), report this result and argue that in consequence you do not have any choice at the moment but fixing the hyper-parameters to some sensible (though not optimized) value.
If you do inner cross validation it would be better to do it subject-wise as well: without this, you'll get overly optimistic inner results. Which would not a problem iff the bias were constant. However, it usually isn't and you have the additional problem that due to the random uncertainty together with the optimistic bias you may observe many models that seem to be perfect. Among these you cannot distinguish (after all, they all seem to be perfect) which can completely mess up the optimization.
Again, with so few subjects I'd avoid this inner optimization and fix the parameter.